Question

我无法在我的outut文件（word_count.txt）中写任何内容。

我希望脚本能够查看我的phrase.txt文档中的所有500个短语，并输出所有单词的列表以及它们出现的次数。

    from re import findall,sub
    from os import listdir
    from collections import Counter

    # path to folder containg all the files
    str_dir_folder = '../data'

    # name and location of output file
    str_output_file = '../data/word_count.txt'

    # the list where all the words will be placed
    list_file_data = '../data/phrases.txt'

    # loop through all the files in the directory
    for str_each_file in listdir(str_dir_folder):
        if str_each_file.endswith('data'):

    # open file and read
    with open(str_dir_folder+str_each_file,'r') as file_r_data:
        str_file_data = file_r_data.read()

    # add data to list
    list_file_data.append(str_file_data)

    # clean all the data so that we don't have all the nasty bits in it
    str_full_data = ' '.join(list_file_data)
    str_clean1 = sub('t','',str_full_data)
    str_clean_data = sub('n',' ',str_clean1)

    # find all the words and put them into a list
    list_all_words = findall('w+',str_clean_data)

    # dictionary with all the times a word has been used
    dict_word_count = Counter(list_all_words)

    # put data in a list, ready for output file
    list_output_data = []
    for str_each_item in dict_word_count:
        str_word = str_each_item
        int_freq = dict_word_count[str_each_item]

        str_out_line = '&quot;%s&quot;,%d' % (str_word,int_freq)

        # populates output list
        list_output_data.append(str_out_line)

    # create output file, write data, close it
    file_w_output = open(str_output_file,'w')
    file_w_output.write('n'.join(list_output_data))
    file_w_output.close()

任何帮助都会很棒（特别是如果我能够在输出列表中输出'单个'单词。

非常感谢。

Answer 1

如果我们获得了更多信息，例如您尝试过的内容以及收到的错误消息，将会很有帮助。正如kaveh在上面评论的那样，这段代码有一些主要的缩进问题。一旦我解决了这些问题，就会遇到许多其他逻辑错误。我做了一些假设：

list_file_data被分配给＆＃39; ../ data / phrase.txt＆＃39;但是有一个循环遍历目录中的所有文件。因为你没有任何处理在其他地方有多个文件，我删除了那个逻辑并引用了 list_file_data中列出的文件（并添加了一点点错误处理）。如果你想浏览一个目录，我建议你使用os.walk（）（http://www.tutorialspoint.com/python/os_walk.htm）
您将文件命名为＃pha; pharses.txt＆＃39;但然后检查文件是否结束了数据＆＃39;。我已经删除了这个逻辑。
当findall与字符串一起使用时，您已将数据集放入列表中，并忽略您手动删除的特殊字符。在这里测试： https://regex101.com/确保。
更改了＆＃39; w +＆＃39;到＆＃39; \ w +＆＃39; - 查看以上链接
转换到输出循环之外的列表是不必要的 - 你的dict_word_count是一个Counter对象，它有一个＆＃39; iteritems＆＃39;滚动每个键和值的方法。还将变量名称更改为＆＃39; counter_word_count＆＃39;稍微准确一点。
我没有手动生成csv，而是导入csv并使用了writerow方法（和引用选项）

下面的代码，希望这会有所帮助：

import csv
import os

from collections import Counter
from re import findall,sub


# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'

if not os.path.exists(list_file_data):
    raise OSError('File {} does not exist.'.format(list_file_data))

with open(list_file_data, 'r') as file_r_data:
    str_file_data = file_r_data.read()
    # find all the words and put them into a list
    list_all_words = findall('\w+',str_file_data)
    # dictionary with all the times a word has been used
    counter_word_count = Counter(list_all_words)

    with open(str_output_file, 'w') as output_file:
        fieldnames = ['word', 'freq']
        writer = csv.writer(output_file, quoting=csv.QUOTE_ALL)
        writer.writerow(fieldnames)

        for key, value in counter_word_count.iteritems():
            output_row = [key, value]
            writer.writerow(output_row)

Answer 2

这样的东西？

from collections import Counter
from glob import glob

def extract_words_from_line(s):
    # make this as complicated as you want for extracting words from a line
    return s.strip().split()

tally = sum(
    (Counter(extract_words_from_line(line)) 
        for infile in glob('../data/*.data')
            for line in open(infile)), 
     Counter())

for k in sorted(tally, key=tally.get, reverse=True):
    print k, tally[k]

无法获得独特的单词/短语计数器 - Python

2 个答案: