如何计算文件中的句子，单词和字符的数量？-Java 学习之路

我编写了以下代码来标记来自文件samp.txt的输入段落 . 有人可以帮我找出并打印文件中的句子，单词和字符的数量吗？我在python中使用了NLTK .

>>>import nltk.data
>>>import nltk.tokenize
>>>f=open('samp.txt')
>>>raw=f.read()
>>>tokenized_sentences=nltk.sent_tokenize(raw)
>>>for each_sentence in tokenized_sentences:
...   words=nltk.tokenize.word_tokenize(each_sentence)
...   print each_sentence   #prints tokenized sentences from samp.txt
>>>tokenized_words=nltk.word_tokenize(raw)
>>>for each_word in tokenized_words:
...   words=nltk.tokenize.word_tokenize(each_word)
...   print each_words      #prints tokenized words from samp.txt

6 回答

-3

以这种方式尝试（此程序假定您正在使用 dirpath 指定的目录中的一个文本文件）：

import nltk
folder = nltk.data.find(dirpath)
corpusReader = nltk.corpus.PlaintextCorpusReader(folder, '.*\.txt')

print "The number of sentences =", len(corpusReader.sents())
print "The number of patagraphs =", len(corpusReader.paras())
print "The number of words =", len([word for sentence in corpusReader.sents() for word in sentence])
print "The number of characters =", len([char for sentence in corpusReader.sents() for word in sentence for char in word])

希望这可以帮助

回复于 2024-05-05T04:18:14+08:00

我认为，问题是什么问题 . 如果使用 textstat 包，计算句子和字符非常容易 . 每个句子末尾的标点符号都有一定的重要性 .

import textstat

your_text = "This is a sentence! This is sentence two. And this is the final sentence?"
print("Num sentences:", textstat.sentence_count(your_text))
print("Num chars:", textstat.char_count(your_text, ignore_spaces=True))
print("Num words:", len(your_text.split()))

回复于 2024-05-05T04:18:14+08:00

0
- 字符很容易计算 .
- 段落通常也很容易计算 . 每当你看到两个连续的换行符时，你可能会有一个段落 . 您可能会说枚举或无序列表是一个段落，即使它们的条目可以分别由两个换行符分隔 . Headers 或 Headers 也可以跟随两个换行符，即使它们显然不是段落 . 还要考虑文件中单个段落的情况，后面有一个或没有换行符 .
- 句子很棘手 . 你可以安顿一段时间，感叹号或问号，然后是空格或文件结尾 . 它's tricky because sometimes colon marks an end of sentence and sometimes it doesn' t . 通常情况下，在英语的情况下，下一个非空格字符将是大写字母 . 但有时候不是;例如，如果它是一个数字 . 有时一个开括号标记句子的结尾（但这是有争议的，就像在这种情况下） .
- 单词也很棘手 . 通常，单词由空格或标点符号分隔 . 有时破折号会划定单词，有时则不会 . 例如，连字符就是这种情况 .
对于单词和句子，您可能需要清楚地陈述您对句子的定义以及单词和程序 .
回复于 2024-05-05T04:18:14+08:00

不是100％正确，但我只是尝试了一下 . 我没有考虑@wilhelmtell的所有观点 . 一旦我有时间，我会尝试一下......

if __name__ == "__main__":
   f = open("1.txt")
   c=w=0
   s=1
   prevIsSentence = False
   for x in f:
      x = x.strip()
      if x != "":
        words = x.split()
        w = w+len(words)
        c = c + sum([len(word) for word in words])
        prevIsSentence = True
      else:
        if prevIsSentence:
           s = s+1
        prevIsSentence = False

   if not prevIsSentence:
      s = s-1
   print "%d:%d:%d" % (c,w,s)

这里1.txt是文件名 .

回复于 2024-05-05T04:18:14+08:00

使用nltk，您还可以使用FreqDist（参见O'Reillys Book Ch3.1）

在你的情况下：

import nltk
raw = open('samp.txt').read()
raw = nltk.Text(nltk.word_tokenize(raw.decode('utf-8')))
fdist = nltk.FreqDist(raw)
print fdist.N()

回复于 2024-05-05T04:18:14+08:00

1

已经有一个计算单词和字符的程序 - wc .

回复于 2024-05-05T04:18:14+08:00

如何计算文件中的句子，单词和字符的数量？

6 回答

相关问题