首页 文章

Pandas CountVectorizer:如何快速过滤行

提问于
浏览
1

我在Pandas有一个文本专栏:

df['TEXT_COL']

然后我将CountVectorizer应用于它:

vectorizer = CountVectorizer()
v = vectorizer.fit_transform(df['TEXT_COL'])

并获得一组单词/功能:

ft = v.get_feature_names()

和TDM:

m = vectorizer.transform(df['TEXT_COL'])

I need: 切片的df,其中只包含来自feature_set ft的特定功能的行 .

怎么弄呢?

Pandas setup:

import pandas as pd

data = [('Word'), ('Word Sea Ocean'), ('Tree'), ('Forest Tree')]

df = pd.DataFrame(data)
df.columns = ['TEXT_COL']

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
v = vectorizer.fit_transform(df['TEXT_COL'])

ft = vectorizer.get_feature_names()
m = vectorizer.transform(df['TEXT_COL'])

enter image description here

对于f in ft:???

1 回答

  • 1

    这是一个小型演示:

    # execute your setup script ...
    
    In [48]: vectorizer.vocabulary_
    Out[48]: {'forest': 0, 'ocean': 1, 'sea': 2, 'tree': 3, 'word': 4}
    

    m 是稀疏矩阵

    In [49]: m
    Out[49]:
    <4x5 sparse matrix of type '<class 'numpy.int64'>'
            with 7 stored elements in Compressed Sparse Row format>
    

    我们可以将它转换为常规的numpy数组:

    In [50]: m.toarray()
    Out[50]:
    array([[0, 0, 0, 0, 1],
           [0, 1, 1, 0, 1],
           [0, 0, 0, 1, 0],
           [1, 0, 0, 1, 0]], dtype=int64)
    

    如何列出特定功能:

    In [51]: m[:, vectorizer.vocabulary_['sea']].toarray()
    Out[51]:
    array([[0],
           [1],
           [0],
           [0]], dtype=int64)
    

    或使用 ft

    In [57]: m[:, ft.index('sea')].toarray()
    Out[57]:
    array([[0],
           [1],
           [0],
           [0]], dtype=int64)
    
    In [52]: df
    Out[52]:
             TEXT_COL
    0            Word
    1  Word Sea Ocean
    2            Tree
    3     Forest Tree
    

    让我们显示包含功能 'tree' 的所有行:

    In [71]: idx = m[:, ft.index('tree')] == 1
    
    In [72]: df[idx.toarray()]
    Out[72]:
          TEXT_COL
    2         Tree
    3  Forest Tree
    

    或者就像这样:

    In [77]: df[m[:, ft.index('tree')].astype(bool).toarray()]
    Out[77]:
          TEXT_COL
    2         Tree
    3  Forest Tree
    

相关问题