我是Python的新手,但遇到了问题。我编写了代码,以识别多个文件的总字数以及唯一字数(在这种情况下,.txt文件是一本书的章节:来自file1的示例文本“在什么时候引起可变性的争议,无论它们是什么,通常都会起作用;无论是在胚胎发育的早期还是晚期,还是在受孕的瞬间。”;来自文件2的示例文本“最后,品种与品种具有相同的一般特征,例如它们不能与物种区分开,除非首先是通过发现中间连接形式”。
我在网上找不到任何有关如何比较文件中单词的示例。我需要确定文件之间共享的单词数以及每个文件(相对于其他文件)唯一的单词数。我的最终输出应包括7个数字:file1和file2的总字数,file1和file2的唯一字数,file1和file2之间共享的字数,file1中但不在file2中的字数以及最后file2中的字数但不在file1中。我知道我必须使用set()来做到这一点,但是我不知道如何。
import glob
from collections import Counter
path = "c-darwin-chapter-?.txt"
wordcount = {}
for filename in glob.glob(path):
with open("c-darwin-chapter-1.txt", 'r') as f1, open("c-darwin-chapter-2.txt", 'r') as f2:
f1_word_list = Counter(f1.read().replace(',','').replace('.','').replace("'",'').replace('!','').replace('&','').replace(';','').replace('(','').replace(')','').replace(':','').replace('?','').lower().split())
print("Total word count per file: ", sum(f1_word_list.values()))
print("Total unique word count: ", len(f1_word_list))
f2_word_list = Counter(f2.read().replace(',','').replace('.','').replace("'",'').replace('!','').replace('&','').replace(';','').replace('(','').replace(')','').replace(':','').replace('?','').lower().split())
print("Total word count per file: ", sum(f2_word_list.values()))
print("Total unique word count: ", len(f2_word_list))
#if/main commented out but final code must use if/main and loop
#if __name__ == '__main__':
# main()
所需的输出:
Total word count
Chapter1 = 11615
Chapter2 = 4837
Unique word count
Chapter1 = 1991
Chapter2 = 1025
Words in Chapter1 and Chapter2: 623
Words in Chapter1 not in Chapter2: 1368
Words in Chapter2 not in Chapter1: 402
答案 0 :(得分:0)
您可以阅读两个文件,并将阅读的文本转换为列表/集合。使用集合,您可以使用集合运算符来计算它们之间的交集/差值:
s.intersection(t) s & t new set with elements common to s and t s.difference(t) s - t new set with elements in s but not in t
以下是设置操作的说明表:Doku 2.x / valid for 3.7 as well
演示:
file1 = "This is some text in some file that you can preprocess as you " +\
"like. This is some text in some file that you can preprocess as you like."
file2 = "this is other text about animals and flowers and flowers and " +\
"animals but not animal-flowers that has to be processed as well"
# split into list - no .lower().replace(...) - you solved that already
list_f1 = file1.split()
list_f2 = file2.split()
# create sets from list (case sensitive)
set_f1 = set( list_f1 )
set_f2 = set( list_f2 )
print(f"Words: {len(list_f1)} vs {len(list_f2)} Unique {len(set_f1)} vs {len(set_f2)}.")
# difference
print(f"Only in 1: {set_f1-set_f2} [{len(set_f1-set_f2)}]")
# intersection
print(f"In both {set_f1&set_f2} [{len(set_f1&set_f2)}]")
# difference the other way round
print(f"Only in 2:{set_f2-set_f1} [{len(set_f2-set_f1)}]")
输出:
Words: 28 vs 22 Unique 12 vs 18.
Only in 1: {'like.', 'in', 'you', 'can', 'file', 'This', 'preprocess', 'some'} [8]
In both {'is', 'that', 'text', 'as'} [4]
Only in 2:{'animals', 'not', 'but', 'animal-flowers', 'to', 'processed',
'has', 'be', 'and', 'well', 'this', 'about', 'other', 'flowers'} [14]
您已经在处理文件读取并将其“统一”为小写-我在这里省略了。输出使用针对Python 3.6的字符串插值语法:请参见PEP 498