Python:从多个文件中打印相似的单词,从单个文件中排除单词,然后将结果打印到新文件?

时间:2018-12-03 14:39:08

标签: python text-files

我正在学习Python入门课程。我目前正在使用Python 3.7.1。我有6个文本文件:file_a.txt,file_b.txt,file_c.txt,file_d.txt,file_e.txt和stop_words.txt

我必须比较文件'a'到'e'并找到出现在所有文件中的单词。我必须将生成的单词写到一个新文件('compare_out.txt')中。但是,stop_words.txt中的所有单词都不允许出现在textcompare.txt中。

我非常不知所措,因为我是代码方面的初学者。只要问题得以解决,我们就可以尽可能地乏味。

这是我到目前为止所得到的。我尝试仅使用file_a来查看我可以做什么,但是代码仅显示文本文件的最后一个字。我知道我应该使用\ n使其更漂亮,但这似乎使代码混乱。如果我在打开的每个文件中都排除了'encoding ='utf-8',也会发生这种情况:

import os
os.chdir(#path)
with open('file_a.txt', 'r', encoding = 'utf-8') as a, open('file_b.txt', 'r', encoding = 'utf-8') as b, open('file_c.txt', 'r', encoding = 'utf-8') as c, open('file_d.txt', 'r', encoding = 'utf-8') as d, open('file_e.txt', 'r', encoding = 'utf-8') as e:
lines_a = a.readlines()
for line in lines_a:
    words_a = line.split()
    for word in words_a:
        ufil = open('compare_out.txt', 'w', encoding = 'utf-8')
        ufil.write(word)
        ufil.close()

在此先感谢您,如果某个地方已经回答了该问题,请原谅。最近几天,我竭尽所能地搜索复杂的内容。

3 个答案:

答案 0 :(得分:0)

_all = []
with open('file_a.txt', 'r', encoding = 'utf-8') as a:
    a_list = a.read().split(' ')
    _all.append(a_list)
    with open('file_b.txt', 'r', encoding = 'utf-8') as b:
        b_list = b.read().split(' ')
        _all.append(b_list)
        with open('file_c.txt', 'r', encoding = 'utf-8') as c:
            c_list = c.read().split(' ')
            _all.append(c_list)
            with open('file_d.txt', 'r', encoding = 'utf-8') as d:
                d_list = d.read().split(' ')
                _all.append(d_list)
                with open('file_e.txt', 'r', encoding = 'utf-8') as e:
                    e_list = e.read().split(' ')
                    _all.append(e_list)

result = set(_all[0])
for s in _all[1:]:
    result.intersection_update(s)
with open('compare_out.txt', 'w', encoding = 'utf-8') as ufill:
for each in result:
    ufill.writelines(each + '\n')

答案 1 :(得分:0)

欢迎来到这里!首先,我认为您需要将程序拆分为可分离的动作。不要尝试一次做所有事情。您还必须考虑不必测试每个文件的每个单词。让我解释一下。

对于算法的每一步,您将比较两个实体。第一次将文件A与文件B进行比较,并将常用词放在列表中。第二次,这两个实体将是具有公用词和文件C的那个列表。将从该列表中删除文件C中没有的每个词。您要对每个文件都这样做,直到最后。

我尝试执行此操作,但尚未测试,但是它为您提供了初步见解:

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
import os
os.chdir(#path)

files_names = ["file_a.txt", "file_b.txt", "and so on"]
common_list = None # will hold the list common words

stop_words = # assuming you have list of stop words 

for i in range(1, len(file_names)):
    # Declare variables
    left = None
    right = None

    # If this is the first loop, get the 0 element of the list (file_a.txt)
    if not common_list:
        with(files_names[i-1], 'r' as f:
            left = f.read().replace('\n', '')
    else: # If not, get the common list
        left = common_list

    # Get the right file
    with open(files_names[i], 'r') as f:
        right = f.read().replace('\n', '')

    # convert string into list
    left = word_tokenize(left)
    right = word_tokenize(right)

    # removing stop words from this list
    left = [w for w in left if not w in stop_words] 
    right = [w for w in right if not w in stop_words] 

    # removing words from common_list hold in right variable
    # that is not on the right file
    left = [w for w in left if not w in right]

    # Put left in common_list for next loop
    common_list = left

    # write your result in file
    with open('compare_out.txt', 'w') as out:
        out.write(common_list)
        out.close()

这是步骤:

  • 获取文件a和文件b,将其放入列表中,然后使用nltk删除停用词
  • 比较此文件并将结果放入common_list
  • 获取文件c,将其放入列表中,然后删除停用词
  • 删除不在文件c中的公用列表中的单词
  • 使用文件d等再次执行此操作,直到结束。

答案 2 :(得分:0)

下面的例子。建议研究每个概念,如果没有意义,请按照自己的喜好重写该部分。阅读以下内容:

  • for循环
  • 数据结构,列表[]和set()
  • 字符串处理,去除空格

        import os
        #os.chdir(#path) //assume files in same directory as *.py file
    
        def read_words_from_list_of_files(list_of_file_names):
            """From a list of files returns a set of words contained in the files"""
            # Make a list of words from the file (assume words separated by white space)
            words_list = []
            for file_name in list_of_file_names:
                with open(file_name, 'r', encoding = 'utf-8') as f:
                    for line_read in f:
                        line = line_read.strip()
                        words_in_this_line = line.split(" ")
                        words_list += words_in_this_line
            return set(words_list)
    
        FILES_OF_INCLUDED_WORDS = ['file_a.txt', 'file_b.txt', 'file_c.txt', 'file_d.txt',  'file_e.txt']
        EXCLUDED_WORDS_FILES = ['stop_words.txt']
        OUTPUT_FILE_NAME = 'compare_out.txt'
        set_of_words_to_include = read_words_from_list_of_files(FILES_OF_INCLUDED_WORDS)
        set_of_words_to_exclude = read_words_from_list_of_files(EXCLUDED_WORDS_FILES)
        # Make a set to eliminate duplicates in the list
        set_of_remaining_words = set_of_words_to_include - set_of_words_to_exclude
        with open(OUTPUT_FILE_NAME, 'w') as f:
            for word in set_of_remaining_words:
                f.write(word + " ") #There will be a space after the last word but maybe this is OK
        print(set_of_remaining_words)