TermDocumentMatrix as.matrix使用大量内存-Java 学习之路

我目前正在使用tm软件包来提取出集群的术语，以便在我的桌面上运行的25k项目（30Mb）的大小相当的数据库中进行重复检测，但是当我尝试在我的服务器上运行它时似乎需要一个不节制的时间 . 仔细观察后，我发现我已经通过4GB交换运行了线路应用（posts.TmDoc，1，sum）来计算条款的频率 . 此外，即使运行as.matrix在我的桌面上生成3GB的文档，请参阅http://imgur.com/a/wllXv

这对于25k项目产生18k项的频率计数是否必要？有没有其他方法来生成频率计数而不将TermDocumentMatrix强制转换为矩阵或向量？

我不能删除基于稀疏性的术语，因为这是实现算法的实现方式 . 它查找至少2个但不超过50个和它们上的组共有的术语，计算每个组的相似度值 .

以下是上下文中的代码供参考

min_word_length = 5
max_word_length = Inf
max_term_occurance = 50
min_term_occurance = 2


# Get All The Posts
Posts = db.getAllPosts()
posts.corpus = Corpus(VectorSource(Posts[,"provider_title"]))

# remove things we don't want
posts.corpus = tm_map(posts.corpus,content_transformer(tolower))
posts.corpus = tm_map(posts.corpus, removePunctuation)
posts.corpus = tm_map(posts.corpus, removeNumbers)
posts.corpus = tm_map(posts.corpus, removeWords, stopwords('english'))

# grab any words longer than 5 characters
posts.TmDoc = TermDocumentMatrix(posts.corpus, control=list(wordLengths=c(min_word_length, max_word_length)))

# get the words that occur more than once, but not more than 50 times
clustterms = names(which(apply(posts.TmDoc, 1, sum) >= min_term_occurance  & apply(posts.TmDoc, 1, sum) < max_term_occurance))

1 回答

因为我实际上从不需要频率计数，所以我可以使用findFreqTerms命令

setdiff(findFreqTerms(posts.TmDoc, 2), findFreqTerms(posts.TmDoc, 50))

是相同的

names(which(apply(posts.TmDoc, 1, sum) >= min_term_occurance  & apply(posts.TmDoc, 1, sum) < max_term_occurance))

但是瞬间跑来跑去

回复于 2024-05-16T07:46:27+08:00

TermDocumentMatrix as.matrix使用大量内存

1 回答

相关问题