首页 文章

如何从R中的ngram标记列表中有效地删除停用词

提问于
浏览
19

这是一种更好的方法,可以做一些我已经无法做到的事情: filter a series of n-gram tokens using "stop words" ,这样n-gram中任何停用词术语的出现都会触发删除 .

我非常希望有一个解决方案适用于unigrams和n-gram,虽然可以有两个版本,一个带有“固定”标志,另一个带有“正则表达式”标志 . 我将这个问题的两个方面放在一起,因为有人可能有一个解决方案尝试一种解决固定和正则表达式停用词模式的不同方法 .

格式:

  • tokens 是一个字符向量列表,可以是unigrams,也可以是由 _ (下划线)字符连接的n-gram .

  • stopwords 是一个字符向量 . 现在我满足于让它成为一个固定的字符串,但是能够使用正则表达式格式化的停用词实现它将是一个很好的奖励 .

Desired Output: 与输入 tokens 匹配的字符列表,但任何组件标记与要删除的停用词匹配 . (这意味着unigram匹配,或与n-gram包含的术语之一匹配 . )

Examples, test data, and working code and benchmarks to build on:

tokens1 <- list(text1 = c("this", "is", "a", "test", "text", "with", "a", "few", "words"), 
                text2 = c("some", "more", "words", "in", "this", "test", "text"))
tokens2 <- list(text1 = c("this_is", "is_a", "a_test", "test_text", "text_with", "with_a", "a_few", "few_words"), 
                text2 = c("some_more", "more_words", "words_in", "in_this", "this_text", "text_text"))
tokens3 <- list(text1 = c("this_is_a", "is_a_test", "a_test_text", "test_text_with", "text_with_a", "with_a_few", "a_few_words"),
                text2 = c("some_more_words", "more_words_in", "words_in_this", "in_this_text", "this_text_text"))
stopwords <- c("is", "a", "in", "this")

# remove any single token that matches a stopword
removeTokensOP1 <- function(w, stopwords) {
    lapply(w, function(x) x[-which(x %in% stopwords)])
}

# remove any word pair where a single word contains a stopword
removeTokensOP2 <- function(w, stopwords) {
    matchPattern <- paste0("(^|_)", paste(stopwords, collapse = "(_|$)|(^|_)"), "(_|$)")
    lapply(w, function(x) x[-grep(matchPattern, x)])
}

removeTokensOP1(tokens1, stopwords)
## $text1
## [1] "test"  "text"  "with"  "few"   "words"
## 
## $text2
## [1] "some"  "more"  "words" "test"  "text" 

removeTokensOP2(tokens1, stopwords)
## $text1
## [1] "test"  "text"  "with"  "few"   "words"
## 
## $text2
## [1] "some"  "more"  "words" "test"  "text" 

removeTokensOP2(tokens2, stopwords)
## $text1
## [1] "test_text" "text_with" "few_words"
## 
## $text2
## [1] "some_more"  "more_words" "text_text" 

removeTokensOP2(tokens3, stopwords)
## $text1
## [1] "test_text_with"
## 
## $text2
## [1] "some_more_words"

# performance benchmarks for answers to build on
require(microbenchmark)
microbenchmark(OP1_1 = removeTokensOP1(tokens1, stopwords),
               OP2_1 = removeTokensOP2(tokens1, stopwords),
               OP2_2 = removeTokensOP2(tokens2, stopwords),
               OP2_3 = removeTokensOP2(tokens3, stopwords),
               unit = "relative")
## Unit: relative
## expr      min       lq     mean   median       uq      max neval
## OP1_1 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100
## OP2_1 5.119066 3.812845 3.438076 3.714492 3.547187 2.838351   100
## OP2_2 5.230429 3.903135 3.509935 3.790143 3.631305 2.510629   100
## OP2_3 5.204924 3.884746 3.578178 3.753979 3.553729 8.240244   100

3 回答

  • 1

    这不是贯穿所有停用词组合的评论 . 使用更长的 stopwords 列表,使用类似 %in% 的内容似乎不会遇到该维度问题 .

    library(purrr)
    removetokenstst <- function(tokens, stopwords) 
      map2(tokens, 
           lapply(tokens3, function(x) { 
             unlist(lapply(strsplit(x, "_"), function(y) { 
               any(y %in% stopwords) 
             })) 
           }), 
           ~ .x[!.y])
    
    require(microbenchmark)
    microbenchmark(OP1_1 = removeTokensOP1(tokens1, morestopwords),
               OP2_1 = removeTokensOP2(tokens1, morestopwords),
               OP2_2 = removeTokensOP2(tokens2, morestopwords),
               OP2_3 = removeTokensOP2(tokens3, morestopwords),
               Ak_3 = removetokenstst(tokens3, stopwords),
               Ak_3msw = removetokenstst(tokens3, morestopwords),
               unit = "relative")
    
    Unit: relative
        expr       min        lq       mean    median        uq      max neval
       OP1_1   1.00000   1.00000   1.000000  1.000000  1.000000  1.00000   100
       OP2_1 278.48260 176.22273  96.462854 79.787932 76.904987 38.31767   100
       OP2_2 280.90242 181.22013  98.545148 81.407928 77.637006 64.94842   100
       OP2_3 279.43728 183.11366 114.879904 81.404236 82.614739 72.04741   100
        Ak_3  15.74301  14.83731   9.340444  7.902213  8.164234 11.27133   100
     Ak_3msw  18.57697  14.45574  12.936594  8.513725  8.997922 24.03969   100
    

    Stopwords

    morestopwords = c("a", "about", "above", "after", "again", "against", "all", 
    "am", "an", "and", "any", "are", "arent", "as", "at", "be", "because", 
    "been", "before", "being", "below", "between", "both", "but", 
    "by", "cant", "cannot", "could", "couldnt", "did", "didnt", "do", 
    "does", "doesnt", "doing", "dont", "down", "during", "each", 
    "few", "for", "from", "further", "had", "hadnt", "has", "hasnt", 
    "have", "havent", "having", "he", "hed", "hell", "hes", "her", 
    "here", "heres", "hers", "herself", "him", "himself", "his", 
    "how", "hows", "i", "id", "ill", "im", "ive", "if", "in", "into", 
    "is", "isnt", "it", "its", "its", "itself", "lets", "me", "more", 
    "most", "mustnt", "my", "myself", "no", "nor", "not", "of", "off", 
    "on", "once", "only", "or", "other", "ought", "our", "ours", 
    "ourselves", "out", "over", "own", "same", "shant", "she", "shed", 
    "shell", "shes", "should", "shouldnt", "so", "some", "such", 
    "than", "that", "thats", "the", "their", "theirs", "them", "themselves", 
    "then", "there", "theres", "these", "they", "theyd", "theyll", 
    "theyre", "theyve", "this", "those", "through", "to", "too", 
    "under", "until", "up", "very", "was", "wasnt", "we", "wed", 
    "well", "were", "weve", "were", "werent", "what", "whats", "when", 
    "whens", "where", "wheres", "which", "while", "who", "whos", 
    "whom", "why", "whys", "with", "wont", "would", "wouldnt", "you", 
    "youd", "youll", "youre", "youve", "your", "yours", "yourself", 
    "yourselves", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", 
    "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", 
    "x", "y", "z")
    
  • 5

    如果您使用 parallel 包在列表中有多个级别,我们可以改进 lapply .

    Create many levels

    tokens2 <- list(text1 = c("this_is", "is_a", "a_test", "test_text", "text_with", "with_a", "a_few", "few_words"), 
                    text2 = c("some_more", "more_words", "words_in", "in_this", "this_text", "text_text"))
    tokens2 <- lapply(1:500,function(x) sample(tokens2,1)[[1]])
    

    我们这样做是因为并行包有很多设置开销,所以只增加microbenchmark上的迭代次数将继续产生这种成本 . 通过增加列表的大小,您可以看到真正的改进 .

    library(parallel)
    #Setup
    cl <- detectCores()
    cl <- makeCluster(cl)
    
    #Two functions:
    
    #original
    removeTokensOP2 <- function(w, stopwords) { 
      matchPattern <- paste0("(^|_)", paste(stopwords, collapse = "(_|$)|(^|_)"), "(_|$)")
      lapply(w, function(x) x[-grep(matchPattern, x)])
    }
    
    #new
    removeTokensOPP <- function(w, stopwords) {
      matchPattern <- paste0("(^|_)", paste(stopwords, collapse = "(_|$)|(^|_)"), "(_|$)")
      return(w[-grep(matchPattern, w)])
    }
    
    #compare
    
    microbenchmark(
      OP2_P = parLapply(cl,tokens2,removeTokensOPP,stopwords),
      OP2_2 = removeTokensOP2(tokens2, stopwords),
      unit = 'relative'
    )
    
    Unit: relative
      expr      min       lq     mean   median       uq      max neval
     OP2_P 1.000000 1.000000 1.000000 1.000000 1.000000  1.00000   100
     OP2_2 1.730565 1.653872 1.678781 1.562258 1.471347 10.11306   100
    

    随着列表中级别数的增加,性能将得到改善 .

  • 1

    你认为你想要简化你的正则表达式,^和$正在增加开销

    remove_short <- function(x, stopwords) {
      stopwords_regexp <- paste0('(^|_)(', paste(stopwords, collapse = '|'), ')(_|$)')
      lapply(x, function(x) x[!grepl(stopwords_regexp, x)])
    }
    require(microbenchmark)
    microbenchmark(OP1_1 = removeTokensOP1(tokens1, stopwords),
                   OP2_1 = removeTokensOP2(tokens2, stopwords),
                   OP2_2 = remove_short(tokens2, stopwords),
                   unit = "relative")
    Unit: relative
      expr      min       lq     mean   median       uq      max neval cld
     OP1_1 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000   100 a  
     OP2_1 5.178565 4.768749 4.465138 4.441130 4.262399 4.266905   100   c
     OP2_2 3.452386 3.247279 3.063660 3.068571 2.963794 2.948189   100  b
    

相关问题