Question

我正在学习Python入门课程。我目前正在使用Python 3.7.1。我有6个文本文件：file_a.txt，file_b.txt，file_c.txt，file_d.txt，file_e.txt和stop_words.txt

我必须比较文件'a'到'e'并找到出现在所有文件中的单词。我必须将生成的单词写到一个新文件（'compare_out.txt'）中。但是，stop_words.txt中的所有单词都不允许出现在textcompare.txt中。

我非常不知所措，因为我是代码方面的初学者。只要问题得以解决，我们就可以尽可能地乏味。

这是我到目前为止所得到的。我尝试仅使用file_a来查看我可以做什么，但是代码仅显示文本文件的最后一个字。我知道我应该使用\ n使其更漂亮，但这似乎使代码混乱。如果我在打开的每个文件中都排除了'encoding ='utf-8'，也会发生这种情况：

import os
os.chdir(#path)
with open('file_a.txt', 'r', encoding = 'utf-8') as a, open('file_b.txt', 'r', encoding = 'utf-8') as b, open('file_c.txt', 'r', encoding = 'utf-8') as c, open('file_d.txt', 'r', encoding = 'utf-8') as d, open('file_e.txt', 'r', encoding = 'utf-8') as e:
lines_a = a.readlines()
for line in lines_a:
    words_a = line.split()
    for word in words_a:
        ufil = open('compare_out.txt', 'w', encoding = 'utf-8')
        ufil.write(word)
        ufil.close()

在此先感谢您，如果某个地方已经回答了该问题，请原谅。最近几天，我竭尽所能地搜索复杂的内容。

Answer 1

_all = []
with open('file_a.txt', 'r', encoding = 'utf-8') as a:
    a_list = a.read().split(' ')
    _all.append(a_list)
    with open('file_b.txt', 'r', encoding = 'utf-8') as b:
        b_list = b.read().split(' ')
        _all.append(b_list)
        with open('file_c.txt', 'r', encoding = 'utf-8') as c:
            c_list = c.read().split(' ')
            _all.append(c_list)
            with open('file_d.txt', 'r', encoding = 'utf-8') as d:
                d_list = d.read().split(' ')
                _all.append(d_list)
                with open('file_e.txt', 'r', encoding = 'utf-8') as e:
                    e_list = e.read().split(' ')
                    _all.append(e_list)

result = set(_all[0])
for s in _all[1:]:
    result.intersection_update(s)
with open('compare_out.txt', 'w', encoding = 'utf-8') as ufill:
for each in result:
    ufill.writelines(each + '\n')

Answer 2

欢迎来到这里！首先，我认为您需要将程序拆分为可分离的动作。不要尝试一次做所有事情。您还必须考虑不必测试每个文件的每个单词。让我解释一下。

对于算法的每一步，您将比较两个实体。第一次将文件A与文件B进行比较，并将常用词放在列表中。第二次，这两个实体将是具有公用词和文件C的那个列表。将从该列表中删除文件C中没有的每个词。您要对每个文件都这样做，直到最后。

我尝试执行此操作，但尚未测试，但是它为您提供了初步见解：

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
import os
os.chdir(#path)

files_names = ["file_a.txt", "file_b.txt", "and so on"]
common_list = None # will hold the list common words

stop_words = # assuming you have list of stop words 

for i in range(1, len(file_names)):
    # Declare variables
    left = None
    right = None

    # If this is the first loop, get the 0 element of the list (file_a.txt)
    if not common_list:
        with(files_names[i-1], 'r' as f:
            left = f.read().replace('\n', '')
    else: # If not, get the common list
        left = common_list

    # Get the right file
    with open(files_names[i], 'r') as f:
        right = f.read().replace('\n', '')

    # convert string into list
    left = word_tokenize(left)
    right = word_tokenize(right)

    # removing stop words from this list
    left = [w for w in left if not w in stop_words] 
    right = [w for w in right if not w in stop_words] 

    # removing words from common_list hold in right variable
    # that is not on the right file
    left = [w for w in left if not w in right]

    # Put left in common_list for next loop
    common_list = left

    # write your result in file
    with open('compare_out.txt', 'w') as out:
        out.write(common_list)
        out.close()

这是步骤：

获取文件a和文件b，将其放入列表中，然后使用nltk删除停用词
比较此文件并将结果放入common_list
获取文件c，将其放入列表中，然后删除停用词
删除不在文件c中的公用列表中的单词
使用文件d等再次执行此操作，直到结束。

Answer 3

下面的例子。建议研究每个概念，如果没有意义，请按照自己的喜好重写该部分。阅读以下内容：

for循环
数据结构，列表[]和set（）

字符串处理，去除空格

    import os
    #os.chdir(#path) //assume files in same directory as *.py file

    def read_words_from_list_of_files(list_of_file_names):
        """From a list of files returns a set of words contained in the files"""
        # Make a list of words from the file (assume words separated by white space)
        words_list = []
        for file_name in list_of_file_names:
            with open(file_name, 'r', encoding = 'utf-8') as f:
                for line_read in f:
                    line = line_read.strip()
                    words_in_this_line = line.split(" ")
                    words_list += words_in_this_line
        return set(words_list)

    FILES_OF_INCLUDED_WORDS = ['file_a.txt', 'file_b.txt', 'file_c.txt', 'file_d.txt',  'file_e.txt']
    EXCLUDED_WORDS_FILES = ['stop_words.txt']
    OUTPUT_FILE_NAME = 'compare_out.txt'
    set_of_words_to_include = read_words_from_list_of_files(FILES_OF_INCLUDED_WORDS)
    set_of_words_to_exclude = read_words_from_list_of_files(EXCLUDED_WORDS_FILES)
    # Make a set to eliminate duplicates in the list
    set_of_remaining_words = set_of_words_to_include - set_of_words_to_exclude
    with open(OUTPUT_FILE_NAME, 'w') as f:
        for word in set_of_remaining_words:
            f.write(word + " ") #There will be a space after the last word but maybe this is OK
    print(set_of_remaining_words)

Python：从多个文件中打印相似的单词，从单个文件中排除单词，然后将结果打印到新文件？

3 个答案: