首页 文章

使用tm()从R中的语料库中删除非英语文本

提问于
浏览
10

我正在使用 tm()wordcloud() 进行R中的一些基本数据挖掘,但由于我的数据集中存在非英文字符(即使我已尝试根据背景变量过滤掉其他语言),我遇到了困难 .

假设我的TXT文件中的一些行(在TextWrangler中保存为UTF-8)如下所示:

Special
satisfação
Happy
Sad
Potential für

然后我将我的txt文件读入R:

words <- Corpus(DirSource("~/temp", encoding = "UTF-8"),readerControl = list(language = "lat"))

这会产生警告消息:

Warning message:
In readLines(y, encoding = x$Encoding) :
  incomplete final line found on '/temp/file.txt'

但由于这是一个警告,而不是错误,我继续向前推进 .

words <- tm_map(words, stripWhitespace)
words <- tm_map(words, tolower)

然后产生错误:

Error in FUN(X[[1L]], ...) : invalid input 'satisfa��o' in 'utf8towcs'

我愿意找到在TextWrangler或R中过滤掉非英文字符的方法;无论什么是最权宜之计 . 谢谢你的帮助!

1 回答

  • 9

    这是一种在创建语料库之前删除包含非ASCII字符的单词的方法:

    # remove words with non-ASCII characters
    # assuming you read your txt file in as a vector, eg. 
    # dat <- readLines('~/temp/dat.txt')
    dat <- "Special,  satisfação, Happy, Sad, Potential, für"
    # convert string to vector of words
    dat2 <- unlist(strsplit(dat, split=", "))
    # find indices of words with non-ASCII characters
    dat3 <- grep("dat2", iconv(dat2, "latin1", "ASCII", sub="dat2"))
    # subset original vector of words to exclude words with non-ASCII char
    dat4 <- dat2[-dat3]
    # convert vector back to a string
    dat5 <- paste(dat4, collapse = ", ")
    # make corpus
    require(tm)
    words1 <- Corpus(VectorSource(dat5))
    inspect(words1)
    
    A corpus with 1 text document
    
    The metadata consists of 2 tag-value pairs and a data frame
    Available tags are:
      create_date creator 
    Available variables in the data frame are:
      MetaID 
    
    [[1]]
    Special, Happy, Sad, Potential
    

相关问题