首页 文章

将CountVectorizer和TfidfTransformer稀疏矩阵转换为单独的Pandas Dataframe Rows

提问于
浏览
5

Question: 将sklearn的CountVectorizer和TfidfTransformer导致的稀疏矩阵转换为Pandas DataFrame列的最佳方法是什么?每个bigram及其相应的频率和tf-idf得分都有一个单独的行?

Pipeline: 从SQL DB中输入文本数据,将文本拆分为双字节并计算每个文档的频率和每个文档的每个bigram的tf-idf,将结果加载回SQL DB .

Current State:

引入两列数据( numbertext ) . 清除 text 以生成第三列 cleanText

number                               text              cleanText
0     123            The farmer plants grain    farmer plants grain
1     234  The farmer and his son go fishing  farmer son go fishing
2     345            The fisher catches tuna    fisher catches tuna

这个DataFrame被输入到sklearn的特征提取中:

cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(data.cleanText)

tfidf_transformer = TfidfTransformer()
tfidf_mat = tfidf_transformer.fit_transform(dt_mat)

然后在将矩阵转换为数组后将矩阵反馈到原始DataFrame中:

data['frequency'] = list(dt_mat.toarray())
data['tfidf_score']=list(tfidf_mat.toarray())

输出:

number                               text              cleanText  \
0     123            The farmer plants grain    farmer plants grain   
1     234  The farmer and his son go fishing  farmer son go fishing   
2     345            The fisher catches tuna    fisher catches tuna   

               frequency                                        tfidf_score  

0  [0, 1, 0, 0, 0, 1, 0]  [0.0, 0.707106781187, 0.0, 0.0, 0.0, 0.7071067...  
1  [0, 0, 1, 0, 1, 0, 1]  [0.0, 0.0, 0.57735026919, 0.0, 0.57735026919, ...  
2  [1, 0, 0, 1, 0, 0, 0]  [0.707106781187, 0.0, 0.0, 0.707106781187, 0.0...

Problems:

  • 功能名称(即双字母组)不在DataFrame中

  • frequencytfidf_score 不在每个二元组的单独行上

Desired Output:

number                    bigram         frequency      tfidf_score
0     123            farmer plants                 1              0.70  
0     123            plants grain                  1              0.56
1     234            farmer son                    1              0.72
1     234            son go                        1              0.63
1     234            go fishing                    1              0.34
2     345            fisher catches                1              0.43
2     345            catches tuna                  1              0.43

我设法使用以下代码获取分配给DataFrame的单独行的数字列之一:

data.reset_index(inplace=True)
rows = []
_ = data.apply(lambda row: [rows.append([row['number'], nn]) 
                         for nn in row.tfidf_score], axis=1)
df_new = pd.DataFrame(rows, columns=['number', 'tfidf_score'])

输出:

number  tfidf_score
0      123     0.000000
1      123     0.707107
2      123     0.000000
3      123     0.000000
4      123     0.000000
5      123     0.707107
6      123     0.000000
7      234     0.000000
8      234     0.000000
9      234     0.577350
10     234     0.000000
11     234     0.577350
12     234     0.000000
13     234     0.577350
14     345     0.707107
15     345     0.000000
16     345     0.000000
17     345     0.707107
18     345     0.000000
19     345     0.000000
20     345     0.000000

但是,我不确定如何为两个数字列执行此操作,并且这不会引入bigrams(功能名称)本身 . 此外,这个方法需要一个数组(这就是我首先将稀疏矩阵转换为数组的原因),如果可能的话,我想避免这种情况,因为性能问题以及我必须去除无意义的行 .

非常感谢任何见解!非常感谢你花时间阅读这个问题 - 我为这个问题道歉 . 如果我能做些什么来改善问题或澄清我的过程,请告诉我 .

1 回答

  • 3

    可以使用 CountVectorizerget_feature_names()捕获二元组名称 . 从那里它只是一系列的 meltmerge 操作:

    print(data)
    
       number                               text              cleanText
    0     123            The farmer plants grain    farmer plants grain
    1     234  The farmer and his son go fishing  farmer son go fishing
    2     345            The fisher catches tuna    fisher catches tuna
    
    from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
    
    cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", stop_words=None, ngram_range=(2,2), analyzer='word')
    dt_mat = cv.fit_transform(data.cleanText)
    
    tfidf_transformer = TfidfTransformer()
    tfidf_mat = tfidf_transformer.fit_transform(dt_mat)
    

    在这种情况下, CountVectorizer 功能名称是bigrams:

    print(cv.get_feature_names())
    
    [u'catches tuna',
     u'farmer plants',
     u'farmer son',
     u'fisher catches',
     u'go fishing',
     u'plants grain',
     u'son go']
    

    CountVectorizer.fit_transform() 返回稀疏矩阵 . 我们可以将它转换为密集表示形式,将其包装在 DataFrame 中,然后将特征名称作为列添加:

    bigrams = pd.DataFrame(dt_mat.todense(), index=data.index, columns=cv.get_feature_names())
    bigrams['number'] = data.number
    print(bigrams)
    
       catches tuna  farmer plants  farmer son  fisher catches  go fishing  \
    0             0              1           0               0           0   
    1             0              0           1               0           1   
    2             1              0           0               1           0   
    
       plants grain  son go  number  
    0             1       0     123  
    1             0       1     234  
    2             0       0     345
    

    要从宽格式转换为长格式,请使用melt() .
    然后将结果限制为bigram匹配(query()在这里很有用):

    bigrams_long = (pd.melt(bigrams.reset_index(), 
                           id_vars=['index','number'],
                           value_name='bigram_ct')
                     .query('bigram_ct > 0')
                     .sort_values(['index','number']))
    
        index  number        variable  bigram_ct
    3       0     123   farmer plants          1
    15      0     123    plants grain          1
    7       1     234      farmer son          1
    13      1     234      go fishing          1
    19      1     234          son go          1
    2       2     345    catches tuna          1
    11      2     345  fisher catches          1
    

    现在重复 tfidf 的过程:

    tfidf = pd.DataFrame(tfidf_mat.todense(), index=data.index, columns=cv.get_feature_names())
    tfidf['number'] = data.number
    
    tfidf_long = pd.melt(tfidf.reset_index(), 
                         id_vars=['index','number'], 
                         value_name='tfidf').query('tfidf > 0')
    

    最后,合并 bigramstfidf

    fulldf = (bigrams_long.merge(tfidf_long, 
                                 on=['index','number','variable'])
                          .set_index('index'))
    
           number        variable  bigram_ct     tfidf
    index                                             
    0         123   farmer plants          1  0.707107
    0         123    plants grain          1  0.707107
    1         234      farmer son          1  0.577350
    1         234      go fishing          1  0.577350
    1         234          son go          1  0.577350
    2         345    catches tuna          1  0.707107
    2         345  fisher catches          1  0.707107
    

相关问题