从Dataframe Pandas中的句子中计算最频繁的100个单词-Java 学习之路

我在Pandas数据帧的一栏中进行了文本评论，我想用频率计数计算N个最频繁的单词（在整个列中 - 不在单个单元格中） . 一种方法是使用计数器计数单词，通过遍历每一行 . 还有更好的选择吗？

代表性数据 .

0    a heartening tale of small victories and endu
1    no sophomore slump for director sam mendes  w
2    if you are an actor who can relate to the sea
3    it's this memory-as-identity obviation that g
4    boyd's screenplay ( co-written with guardian

2 回答

17
```
Counter(" ".join(df["text"]).split()).most_common(100)
```
我很确定会给你你想要的东西（在调用most_common之前你可能需要从计数器结果中删除一些非单词）
回复于 2024-05-10T05:38:25+08:00

除了@Joran的解决方案，您还可以使用 series.value_counts 来处理大量文本/行

pd.Series(' '.join(df['text']).lower().split()).value_counts()[:100]

您可以从基准测试中发现 series.value_counts 似乎比 Counter 方法快两倍（2X）

对于3000行的电影评论数据集，总计400K字符和70k字 .

In [448]: %timeit Counter(" ".join(df.text).lower().split()).most_common(100)
10 loops, best of 3: 44.2 ms per loop

In [449]: %timeit pd.Series(' '.join(df.text).lower().split()).value_counts()[:100]
10 loops, best of 3: 27.1 ms per loop

回复于 2024-05-10T05:38:25+08:00

从Dataframe Pandas中的句子中计算最频繁的100个单词

2 回答

相关问题