如何在Python中循环读取多个文件并获取匹配单词的数量

时间:2018-08-11 11:25:53

标签: python-3.x

我有两个文本文件和2个列表(FIRST_LIST,SCND_LIST),我想分别从FIRST_LIST,SCND_LIST中找出与单词匹配的每个文件的计数。

FIRST_LIST =

"accessorizes","accessorizing","accessorized","accessorize"

SCND_LIST =

"accessorize","accessorized","accessorizes","accessorizing"

文本File1包含:

  

这是一个非常好的问题,您已经收到了很好的答案,它们描述了有趣的主题。

文本File2包含:

  

应用更广泛,使用accessorize accessorized,accessorizes,accessorizing

输出

File1 first list count=2
File1 second list count=0

File2 first list count=0
File2 second list count=4

此代码我已尝试实现此功能,但无法获得预期的输出。 如果有帮助的话

import os 
import glob
files=[]

for filename in glob.glob("*.txt"):
    files.append(filename)


# remove Punctuations
import re

def remove_punctuation(line):
    return re.sub(r'[^\w\s]', '', line)

two_files=[]
for filename in files:
    for line in open(filename):
        #two_files.append(remove_punctuation(line))
        print(remove_punctuation(line),end='')
        two_files.append(remove_punctuation(line))

FIRST_LIST = "accessorizes","accessorizing","accessorized","accessorize"

SCND_LIST="accessorize","accessorized","accessorizes","accessorizing"

c=[]
for match in FIRST_LIST:
    if any(match in value for value in two_files):
        #c=match+1
        print (match)
        c.append(match)
print(c)
len(c)
d=[]
for match in SCND_LIST:
    if any(match in value for value in two_files):
        #c=match+1
        print (match)
        d.append(match)
print(d)
len(d)

1 个答案:

答案 0 :(得分:2)

使用Counter和一些列表理解是解决问题的许多不同方法之一。

我认为,您的示例输出是错误的,因为某些单词是两个列表和两个文件的一部分,但没有计算在内。此外,我在示例字符串中添加了第二行,以显示该示例如何与多行字符串(可能是给定文件的典型内容)一起工作。

io.StringIO对象可以模拟您的文件,但是使用文件系统中的真实文件可以完全相同,因为两者都提供了类似文件的对象或类似文件的界面:

from collections import Counter

list_a = ["accessorizes", "accessorizing", "accessorized", "accessorize"]
list_b = ["accessorize", "accessorized", "accessorizes", "accessorizing"]

# added a second line to each string just for the sake
file_contents_a = 'This is a very good question, and you have received good answers which describe interesting topics accessorized accessorize.\nThis is the second line in file a'
file_contents_b = 'is more applied,using accessorize accessorized,accessorizes,accessorizing\nThis is the second line in file b'

# using io.StringIO to simulate a file input (--> file-like object)
# you should use `with open(filename) as ...` for real file input
file_like_a = io.StringIO(file_contents_a)
file_like_b = io.StringIO(file_contents_b)

# read file contents and split lines into a list of strings
lines_of_file_a = file_like_a.read().splitlines()
lines_of_file_b = file_like_b.read().splitlines()

# iterate through all lines of each file (for file a here)
for line_number, line in enumerate(lines_of_file_a):
    words = line.replace('.', ' ').replace(',', ' ').split(' ')
    c = Counter(words)
    in_list_a = sum([v for k,v in c.items() if k in list_a])
    in_list_b = sum([v for k,v in c.items() if k in list_b])
    print("Line {}".format(line_number))
    print("- in list a {}".format(in_list_a))
    print("- in list b {}".format(in_list_b))


# iterate through all lines of each file (for file b here)
for line_number, line in enumerate(lines_of_file_b):
    words = line.replace('.', ' ').replace(',', ' ').split(' ')
    c = Counter(words)
    in_list_a = sum([v for k,v in c.items() if k in list_a])
    in_list_b = sum([v for k,v in c.items() if k in list_b])
    print("Line {}".format(line_number))
    print("- in list a {}".format(in_list_a))
    print("- in list b {}".format(in_list_b))    


# actually, your two lists are the same
lists_are_equal = sorted(list_a) == sorted(list_b)
print(lists_are_equal)