我正在学习Python入门课程。我目前正在使用Python 3.7.1。我有6个文本文件:file_a.txt,file_b.txt,file_c.txt,file_d.txt,file_e.txt和stop_words.txt
我必须比较文件'a'到'e'并找到出现在所有文件中的单词。我必须将生成的单词写到一个新文件('compare_out.txt')中。但是,stop_words.txt中的所有单词都不允许出现在textcompare.txt中。
我非常不知所措,因为我是代码方面的初学者。只要问题得以解决,我们就可以尽可能地乏味。
这是我到目前为止所得到的。我尝试仅使用file_a来查看我可以做什么,但是代码仅显示文本文件的最后一个字。我知道我应该使用\ n使其更漂亮,但这似乎使代码混乱。如果我在打开的每个文件中都排除了'encoding ='utf-8',也会发生这种情况:
import os
os.chdir(#path)
with open('file_a.txt', 'r', encoding = 'utf-8') as a, open('file_b.txt', 'r', encoding = 'utf-8') as b, open('file_c.txt', 'r', encoding = 'utf-8') as c, open('file_d.txt', 'r', encoding = 'utf-8') as d, open('file_e.txt', 'r', encoding = 'utf-8') as e:
lines_a = a.readlines()
for line in lines_a:
words_a = line.split()
for word in words_a:
ufil = open('compare_out.txt', 'w', encoding = 'utf-8')
ufil.write(word)
ufil.close()
在此先感谢您,如果某个地方已经回答了该问题,请原谅。最近几天,我竭尽所能地搜索复杂的内容。
答案 0 :(得分:0)
_all = []
with open('file_a.txt', 'r', encoding = 'utf-8') as a:
a_list = a.read().split(' ')
_all.append(a_list)
with open('file_b.txt', 'r', encoding = 'utf-8') as b:
b_list = b.read().split(' ')
_all.append(b_list)
with open('file_c.txt', 'r', encoding = 'utf-8') as c:
c_list = c.read().split(' ')
_all.append(c_list)
with open('file_d.txt', 'r', encoding = 'utf-8') as d:
d_list = d.read().split(' ')
_all.append(d_list)
with open('file_e.txt', 'r', encoding = 'utf-8') as e:
e_list = e.read().split(' ')
_all.append(e_list)
result = set(_all[0])
for s in _all[1:]:
result.intersection_update(s)
with open('compare_out.txt', 'w', encoding = 'utf-8') as ufill:
for each in result:
ufill.writelines(each + '\n')
答案 1 :(得分:0)
欢迎来到这里!首先,我认为您需要将程序拆分为可分离的动作。不要尝试一次做所有事情。您还必须考虑不必测试每个文件的每个单词。让我解释一下。
对于算法的每一步,您将比较两个实体。第一次将文件A与文件B进行比较,并将常用词放在列表中。第二次,这两个实体将是具有公用词和文件C的那个列表。将从该列表中删除文件C中没有的每个词。您要对每个文件都这样做,直到最后。
我尝试执行此操作,但尚未测试,但是它为您提供了初步见解:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import os
os.chdir(#path)
files_names = ["file_a.txt", "file_b.txt", "and so on"]
common_list = None # will hold the list common words
stop_words = # assuming you have list of stop words
for i in range(1, len(file_names)):
# Declare variables
left = None
right = None
# If this is the first loop, get the 0 element of the list (file_a.txt)
if not common_list:
with(files_names[i-1], 'r' as f:
left = f.read().replace('\n', '')
else: # If not, get the common list
left = common_list
# Get the right file
with open(files_names[i], 'r') as f:
right = f.read().replace('\n', '')
# convert string into list
left = word_tokenize(left)
right = word_tokenize(right)
# removing stop words from this list
left = [w for w in left if not w in stop_words]
right = [w for w in right if not w in stop_words]
# removing words from common_list hold in right variable
# that is not on the right file
left = [w for w in left if not w in right]
# Put left in common_list for next loop
common_list = left
# write your result in file
with open('compare_out.txt', 'w') as out:
out.write(common_list)
out.close()
这是步骤:
答案 2 :(得分:0)
下面的例子。建议研究每个概念,如果没有意义,请按照自己的喜好重写该部分。阅读以下内容:
字符串处理,去除空格
import os
#os.chdir(#path) //assume files in same directory as *.py file
def read_words_from_list_of_files(list_of_file_names):
"""From a list of files returns a set of words contained in the files"""
# Make a list of words from the file (assume words separated by white space)
words_list = []
for file_name in list_of_file_names:
with open(file_name, 'r', encoding = 'utf-8') as f:
for line_read in f:
line = line_read.strip()
words_in_this_line = line.split(" ")
words_list += words_in_this_line
return set(words_list)
FILES_OF_INCLUDED_WORDS = ['file_a.txt', 'file_b.txt', 'file_c.txt', 'file_d.txt', 'file_e.txt']
EXCLUDED_WORDS_FILES = ['stop_words.txt']
OUTPUT_FILE_NAME = 'compare_out.txt'
set_of_words_to_include = read_words_from_list_of_files(FILES_OF_INCLUDED_WORDS)
set_of_words_to_exclude = read_words_from_list_of_files(EXCLUDED_WORDS_FILES)
# Make a set to eliminate duplicates in the list
set_of_remaining_words = set_of_words_to_include - set_of_words_to_exclude
with open(OUTPUT_FILE_NAME, 'w') as f:
for word in set_of_remaining_words:
f.write(word + " ") #There will be a space after the last word but maybe this is OK
print(set_of_remaining_words)