无法获得独特的单词/短语计数器 - Python

时间:2016-01-23 21:16:48

标签: python shell keyword

我无法在我的outut文件(word_count.txt)中写任何内容。

我希望脚本能够查看我的phrase.txt文档中的所有500个短语,并输出所有单词的列表以及它们出现的次数。

    from re import findall,sub
    from os import listdir
    from collections import Counter

    # path to folder containg all the files
    str_dir_folder = '../data'

    # name and location of output file
    str_output_file = '../data/word_count.txt'

    # the list where all the words will be placed
    list_file_data = '../data/phrases.txt'

    # loop through all the files in the directory
    for str_each_file in listdir(str_dir_folder):
        if str_each_file.endswith('data'):

    # open file and read
    with open(str_dir_folder+str_each_file,'r') as file_r_data:
        str_file_data = file_r_data.read()

    # add data to list
    list_file_data.append(str_file_data)

    # clean all the data so that we don't have all the nasty bits in it
    str_full_data = ' '.join(list_file_data)
    str_clean1 = sub('t','',str_full_data)
    str_clean_data = sub('n',' ',str_clean1)

    # find all the words and put them into a list
    list_all_words = findall('w+',str_clean_data)

    # dictionary with all the times a word has been used
    dict_word_count = Counter(list_all_words)

    # put data in a list, ready for output file
    list_output_data = []
    for str_each_item in dict_word_count:
        str_word = str_each_item
        int_freq = dict_word_count[str_each_item]

        str_out_line = '"%s",%d' % (str_word,int_freq)

        # populates output list
        list_output_data.append(str_out_line)

    # create output file, write data, close it
    file_w_output = open(str_output_file,'w')
    file_w_output.write('n'.join(list_output_data))
    file_w_output.close()

任何帮助都会很棒(特别是如果我能够在输出列表中输出'单个'单词。

非常感谢。

2 个答案:

答案 0 :(得分:3)

如果我们获得了更多信息,例如您尝试过的内容以及收到的错误消息,将会很有帮助。正如kaveh在上面评论的那样,这段代码有一些主要的缩进问题。一旦我解决了这些问题,就会遇到许多其他逻辑错误。我做了一些假设:

  • list_file_data被分配给' ../ data / phrase.txt'但是有一个 循环遍历目录中的所有文件。因为你没有任何处理 在其他地方有多个文件,我删除了那个逻辑并引用了 list_file_data中列出的文件(并添加了一点点错误 处理)。如果你想浏览一个目录,我建议你 使用os.walk()(http://www.tutorialspoint.com/python/os_walk.htm
  • 您将文件命名为#pha; pharses.txt'但然后检查文件是否 结束了数据'。我已经删除了这个逻辑。
  • 当findall与字符串一起使用时,您已将数据集放入列表中,并忽略您手动删除的特殊字符。在这里测试: https://regex101.com/确保。
  • 更改了' w +'到' \ w +' - 查看以上链接
  • 转换到输出循环之外的列表是不必要的 - 你的dict_word_count是一个Counter对象,它有一个' iteritems'滚动每个键和值的方法。还将变量名称更改为' counter_word_count'稍微准确一点。
  • 我没有手动生成csv,而是导入csv并使用了writerow方法(和引用选项)

下面的代码,希望这会有所帮助:

import csv
import os

from collections import Counter
from re import findall,sub


# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'

if not os.path.exists(list_file_data):
    raise OSError('File {} does not exist.'.format(list_file_data))

with open(list_file_data, 'r') as file_r_data:
    str_file_data = file_r_data.read()
    # find all the words and put them into a list
    list_all_words = findall('\w+',str_file_data)
    # dictionary with all the times a word has been used
    counter_word_count = Counter(list_all_words)

    with open(str_output_file, 'w') as output_file:
        fieldnames = ['word', 'freq']
        writer = csv.writer(output_file, quoting=csv.QUOTE_ALL)
        writer.writerow(fieldnames)

        for key, value in counter_word_count.iteritems():
            output_row = [key, value]
            writer.writerow(output_row)

答案 1 :(得分:1)

这样的东西?

from collections import Counter
from glob import glob

def extract_words_from_line(s):
    # make this as complicated as you want for extracting words from a line
    return s.strip().split()

tally = sum(
    (Counter(extract_words_from_line(line)) 
        for infile in glob('../data/*.data')
            for line in open(infile)), 
     Counter())

for k in sorted(tally, key=tally.get, reverse=True):
    print k, tally[k]