从文件中的n行块中提取项目,计算每个块的项目频率,Python

时间:2011-08-23 14:01:32

标签: python text-processing

我有一个文本文件,其中包含5行的制表符分隔行:

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 1 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

 2 \t DESCRIPTION \t SENTENCE \t ITEMS

在每个块中,DESCRIPTION和SENTENCE列是相同的。感兴趣的数据位于ITEMS列中,该列对于块中的每一行是不同的,并且采用以下格式:

word1, word2, word3

......等等

对于每个5行块,我需要计算ITEMS中word1,word2等的频率。例如,如果第一个5行块如下

 1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3

 1 \t DESCRIPTION \t SENTENCE \t word1, word2

 1 \t DESCRIPTION \t SENTENCE \t word4

 1 \t DESCRIPTION \t SENTENCE \t word1, word2, word3

 1 \t DESCRIPTION \t SENTENCE \t word1, word2

然后这个5行块的正确输出将是

1, SENTENCE, (word1: 4, word2: 4, word3: 2, word4: 1)

即,句子后面跟着句子,然后是单词的频率计数。

我有一些代码可以提取五行块并在一次提取后计算一个块中的字的频率,但我仍然坚持隔离每个块,获得字频率,继续下一个,等

from itertools import groupby 

def GetFrequencies(file):
    file_contents = open(file).readlines()  #file as list
    """use zip to get the entire file as list of 5-line chunk tuples""" 
    five_line_increments = zip(*[iter(file_contents)]*5) 
    for chunk in five_line_increments:  #for each 5-line chunk... 
        for sentence in chunk:          #...and for each sentence in that chunk
            words = sentence.split('\t')[3].split() #get the ITEMS column at index 3
            words_no_comma = [x.strip(',') for x in words]  #get rid of the commas
            words_no_ws = [x.strip(' ')for x in words_no_comma] #get rid of the whitespace resulting from the removed commas


       """STUCK HERE   The idea originally was to take the words lists for 
       each chunk and combine them to create a big list, 'collection,' and
       feed this into the for-loop below."""





    for key, group in groupby(collection): #collection is a big list containing all of the words in the ITEMS section of the chunk, e.g, ['word1', 'word2', word3', 'word1', 'word1', 'word2', etc.]
        print key,len(list(group)),    

4 个答案:

答案 0 :(得分:1)

使用python 2.7

#!/usr/bin/env python

import collections

chunks={}

with open('input') as fd:
    for line in fd:
        line=line.split()
        if not line:
            continue
        if chunks.has_key(line[0]):
            for i in line[3:]:
                chunks[line[0]].append(i.replace(',',''))
        else:
            chunks[line[0]]=[line[2]]

for k,v in chunks.iteritems():
    counter=collections.Counter(v[1:])
    print k, v[0], counter

输出:

1 SENTENCE Counter({'word1': 3, 'word2': 3, 'word4': 1, 'word3': 1})

答案 1 :(得分:1)

标准库中有一个csv解析器,可以为你处理输入拆分

import csv
import collections

def GetFrequencies(file_in):
    sentences = dict()
    with csv.reader(open(file_in, 'rb'), delimiter='\t') as csv_file:
        for line in csv_file:
            sentence = line[0]
            if sentence not in sentences:
                sentences[sentence] = collections.Counter()
            sentences[sentence].update([x.strip(' ') for x in line[3].split(',')])

答案 2 :(得分:0)

总结一下:如果它们不是“描述”或“SENTENCE”,您想要将所有“单词”附加到集合中吗?试试这个:

for word in words_no_ws:
    if word not in ("DESCRIPTION", "SENTENCE"):
        collection.append(word)

答案 3 :(得分:0)

稍微编辑一下你的代码,我认为它可以做你想做的事情:

file_contents = open(file).readlines()  #file as list
"""use zip to get the entire file as list of 5-line chunk tuples""" 
five_line_increments = zip(*[iter(file_contents)]*5) 
for chunk in five_line_increments:  #for each 5-line chunk...
    word_freq = {} #word frequencies for each chunk
    for sentence in chunk:          #...and for each sentence in that chunk
        words = "".join(sentence.split('\t')[3]).strip('\n').split(', ') #get the ITEMS column at index 3 and put them in list
        for word in words:
            if word not in word_freq:
                word_freq[word] = 1
            else:
                word_freq[word] += 1


    print word_freq

输出:

{'word4': 1, 'word1': 4, 'word3': 2, 'word2': 4}