这是一种更好的方法,可以做一些我已经无法做到的事情: filter a series of n-gram tokens using "stop words" ,这样n-gram中任何停用词术语的出现都会触发删除 .
我非常希望有一个解决方案适用于unigrams和n-gram,虽然可以有两个版本,一个带有“固定”标志,另一个带有“正则表达式”标志 . 我将这个问题的两个方面放在一起,因为有人可能有一个解决方案尝试一种解决固定和正则表达式停用词模式的不同方法 .
格式:
-
tokens 是一个字符向量列表,可以是unigrams,也可以是由
_
(下划线)字符连接的n-gram . -
stopwords 是一个字符向量 . 现在我满足于让它成为一个固定的字符串,但是能够使用正则表达式格式化的停用词实现它将是一个很好的奖励 .
Desired Output: 与输入 tokens 匹配的字符列表,但任何组件标记与要删除的停用词匹配 . (这意味着unigram匹配,或与n-gram包含的术语之一匹配 . )
Examples, test data, and working code and benchmarks to build on:
tokens1 <- list(text1 = c("this", "is", "a", "test", "text", "with", "a", "few", "words"),
text2 = c("some", "more", "words", "in", "this", "test", "text"))
tokens2 <- list(text1 = c("this_is", "is_a", "a_test", "test_text", "text_with", "with_a", "a_few", "few_words"),
text2 = c("some_more", "more_words", "words_in", "in_this", "this_text", "text_text"))
tokens3 <- list(text1 = c("this_is_a", "is_a_test", "a_test_text", "test_text_with", "text_with_a", "with_a_few", "a_few_words"),
text2 = c("some_more_words", "more_words_in", "words_in_this", "in_this_text", "this_text_text"))
stopwords <- c("is", "a", "in", "this")
# remove any single token that matches a stopword
removeTokensOP1 <- function(w, stopwords) {
lapply(w, function(x) x[-which(x %in% stopwords)])
}
# remove any word pair where a single word contains a stopword
removeTokensOP2 <- function(w, stopwords) {
matchPattern <- paste0("(^|_)", paste(stopwords, collapse = "(_|$)|(^|_)"), "(_|$)")
lapply(w, function(x) x[-grep(matchPattern, x)])
}
removeTokensOP1(tokens1, stopwords)
## $text1
## [1] "test" "text" "with" "few" "words"
##
## $text2
## [1] "some" "more" "words" "test" "text"
removeTokensOP2(tokens1, stopwords)
## $text1
## [1] "test" "text" "with" "few" "words"
##
## $text2
## [1] "some" "more" "words" "test" "text"
removeTokensOP2(tokens2, stopwords)
## $text1
## [1] "test_text" "text_with" "few_words"
##
## $text2
## [1] "some_more" "more_words" "text_text"
removeTokensOP2(tokens3, stopwords)
## $text1
## [1] "test_text_with"
##
## $text2
## [1] "some_more_words"
# performance benchmarks for answers to build on
require(microbenchmark)
microbenchmark(OP1_1 = removeTokensOP1(tokens1, stopwords),
OP2_1 = removeTokensOP2(tokens1, stopwords),
OP2_2 = removeTokensOP2(tokens2, stopwords),
OP2_3 = removeTokensOP2(tokens3, stopwords),
unit = "relative")
## Unit: relative
## expr min lq mean median uq max neval
## OP1_1 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
## OP2_1 5.119066 3.812845 3.438076 3.714492 3.547187 2.838351 100
## OP2_2 5.230429 3.903135 3.509935 3.790143 3.631305 2.510629 100
## OP2_3 5.204924 3.884746 3.578178 3.753979 3.553729 8.240244 100
3 回答
这不是贯穿所有停用词组合的评论 . 使用更长的
stopwords
列表,使用类似%in%
的内容似乎不会遇到该维度问题 .Stopwords
如果您使用
parallel
包在列表中有多个级别,我们可以改进lapply
.Create many levels
我们这样做是因为并行包有很多设置开销,所以只增加microbenchmark上的迭代次数将继续产生这种成本 . 通过增加列表的大小,您可以看到真正的改进 .
随着列表中级别数的增加,性能将得到改善 .
你认为你想要简化你的正则表达式,^和$正在增加开销