Question

我有一些波斯文件。所有这些都包含很多句子，然后是“标签”，然后是波斯语，再一次是“标签”，然后是英语单词。英语单词显示每个句子类。我必须计算所有课程中波斯语句子中每个单词的数量。例如，“激情”类中出现“دانشگاه”一词多少次，以及“咸”类出现多少次。（有些文件有2个以上的类）。我写的代码只计算文件中的单词一次。如上所述，如何更改它返回单词count？（提示：我只需要句子中的单词计数而不是“tab”之后的波斯语和英语单词）。

from collections import Counter

corpus = []
with open("T.txt", encoding='utf-8') as f:
    for line in f:
        t = line.strip().split("\t")
        corpus.append (t)
        for row in corpus:
            wordcount = Counter(row[0].split())
        print (wordcount)

https://www.dropbox.com/s/r88hglemg7aot0w/F.txt?dl=0

结果如上图所示。但是对于所有的话，我想要的东西应该如下所示：激情{“دانشگاه”：1，...} 咸的{“دانشگاه”：0，.....}

Answer 1

以下不是最有效的方法，但就其工作方式而言则更为明确。

from collections import Counter, defaultdict

#find all Persian words and save them in a set
vocab = set()
classes = set()
with open("T.txt", encoding='utf-8') as fin:
    for line in fin:
        t = line.strip().split('\t')
        sentences = t[0]
        class = t[2]
        classes.add(class)
        for word in sentences.split():
            vocab.add(word)
class_word_count = defaultdict(dict)
for class in classes:
    for word in vocab:
        class_word_count[class][word] = 0
#now start counting
with open("T.txt", encoding='utf-8') as fin:
    for line in fin:
        t = line.strip().split('\t')
        sentences = t[0]
        class = t[2]
        for word in sentences.split():
            class_word_count[class][word] = class_word_count[class][word] + 1
print(class_word_count )

Answer 2

断言，文件结构是固定的，这样总是在第[2]行找到类，然后剩下的就是每行但不是总计聚合。编辑：此代码将聚合每个找到的单词，并为找到该单词的类别保留一个计数器。如果某个类别没有计数器，则该单词不存在于该类别中。

from collections import Counter, defaultdict

wordcount = defaultdict(Counter)
with open("T.txt", encoding='utf-8') as f:
    for line in f:
        t = line.strip().split("\t")
        for word in t[0].split():
            wordcount[word] += Counter([t[2]])
print (wordcount)

计算文件的所有类中的单词

2 个答案: