Question

我已经四处寻找解决这个问题的方法，但我还没有找到它。我有一个大的文本文件，它被分成句子，只用＆＃34;分隔。＆＃34;我需要计算每个句子有多少单词并将其写入文件。我为这部分代码使用了一个单独的文件，到目前为止我已经有了这个

    tekst = open('father_goriot.txt','r').read()
    tekst = tekst.split('.')

有了这个，我得到一个＆＃39;列表＆＃39;输入变量，每个句子都有自己的索引。我知道如果我写

    print len(tekst[0].split())

我得到第一句话中的单词数量。我需要的是某种循环来获得每个句子中的单词数量。之后，我需要将这些数据写入一个表格中的文件： 1.文本中句子的索引号，2。该特定句子中的单词数，3。不同文本中同一句子中的单词数（通过在单独的文件中使用代码来翻译第一个文本）），4。两个句子共有的单词数量。有什么想法吗？

Answer 1

经过一段时间的搜索并找到一个更简单的解决方案后，我偶然发现了一个代码，它给出了我想要的部分结果。每个句子中的单词数量。它由数字列表表示，它看起来像这样：

    wordcounts = []
    with open('father_goriot.txt') as f:
       text = f.read()
       sentences = text.split('.')
       for sentence in sentences:
           words = sentence.split(' ')
           wordcounts.append(len(words))

但这个数字是不正确的，因为它还有更多的东西。所以对于第一句话我得到的结果是40而不是38个单词。我该如何解决这个问题。

Answer 2

只需枚举整个文件：

import re

with open('data.txt') as data:
    for line, words in enumerate(data):
        args = line + 1, re.split(r'[!?\.\s]+', words) # formatter
        print('Sentence at line {0} has {1} words.'.format(*args))

Answer 3

你需要遍历文件并逐行读取：

file = open('file.txt', 'r')

for line in file:
    do something with the line

Answer 4

获取每个项目对应一个句子的列表：

def count_words_per_sentence(filename):
    """
    :type filename: str
    :rtype: list[int]
    """
    with open(filename) as f:
        sentences = f.read().split('.')
    return [len(sentence.split()) for sentence in sentences]

要测试两个句子共有多少个单词，您应该使用set操作。例如：

 words_1 = sentence_1.split()
 words_2 = sentence_2.split()
 in_common = set(words_1) & set(words_2)  # set intersection

对于文件io，请查看csv模块和writer函数。将您的行构建为列表列表 - 查看zip - 然后将其提供给csv writer。

word_counts_1 = count_words_per_sentence(filename_one)
word_counts_2 = count_words_per_sentence(filename_two)
in_common = count_words_in_common_per_sentence(filename_one, filename_two)
rows = zip(itertools.count(1), word_counts_1, word_counts_2, in_common)
header = [["index", "file_one", "file_two", "in_common"]]
table = header + rows

# https://docs.python.org/2/library/csv.html
with open("my_output_file.csv", 'w') as f:
     writer = csv.writer(f)
     writer.writerows(table)

如何计算python中多个句子中文本句子中的单词

4 个答案: