计算文件中的字符串,一些单词,一些完整的句子

时间:2016-02-05 03:33:09

标签: python-2.7 text counter word-count

我想计算文件中某些单词和名称的出现次数。以下代码错误地将fish and chips计为fish的一个案例和chips的一个案例,而不是fish and chips的一个案例。

ngh.txt = 'test file with words fish, steak fish chips fish and chips'

import re
from collections import Counter
wanted = '''
"fish and chips"
fish
chips
steak
'''
cnt = Counter()
words = re.findall('\w+', open('ngh.txt').read().lower())
for word in words:
    if word in wanted:
        cnt[word] += 1
print cnt

输出:

Counter({'fish': 3, 'chips': 2, 'and': 1, 'steak': 1})

我想要的是:

Counter({'fish': 2, 'fish and chips': 1, 'chips': 1, 'steak': 1})

(理想情况下,我可以得到这样的输出:

fish: 2
fish and chips: 1
chips: 1
steak: 1

3 个答案:

答案 0 :(得分:1)

所以这个解决方案适用于你的测试数据(以及对测试数据的一些附加术语,只是为了彻底),尽管它可能会得到改进。

它的关键是在单词列表中找到'和'的出现,然后用复合词替换'和'及其邻居(用'和'连接邻居)并将其添加回列表,以及'和'的副本。

我还将'通缉'字符串转换为列表,将'鱼和薯条'字符串作为一个独特的项目处理。

import re
from collections import Counter

# changed 'wanted' string to a list
wanted = ['fish and chips','fish','chips','steak', 'and']

cnt = Counter()

words = re.findall('\w+', open('ngh.txt').read().lower())

for word in words:

    # look for 'and', replace it and neighbours with 'comp_word'
    # slice, concatenate, and append to make new words list

    if word == 'and':
        and_pos = words.index('and')
        comp_word = str(words[and_pos-1]) + ' and '  +str(words[and_pos+1])
        words = words[:and_pos-1] + words[and_pos+2:]
        words.append(comp_word)
        words.append('and')

for word in words:
    if word in wanted:
        cnt[word] += 1

print cnt

您的文字输出为:

Counter({'fish':2, 'and':1, 'steak':1, 'chips':1, 'fish and chips':1})

正如上面的评论中所指出的,目前还不清楚为什么你想要/期望输出为鱼,2为芯片,1为鱼和芯片的理想输出。我假设它是一个拼写错误,因为它上面的输出有'chips':1

答案 1 :(得分:1)

我建议两种算法适用于任何模式和任何文件。 第一种算法的运行时间与(文件中的字符数)*模式数成正比。

1>对于每个模式搜索所有模式并创建超级模式列表。这可以通过匹配一个模式来完成,例如“猫”和“猫”。反对所有要搜索的模式。

patterns = ['cat', 'cat and dogs', 'cat and fish']
superpattern['cat']  = ['cat and dogs', 'cat and fish']

2 - ;搜索' cat'在文件中,让我们说结果是cat_count 3 GT;现在搜索“猫”的每一种晚餐模式。在档案中并获得他们的计数

for (sp  in superpattern['cat']) :
    sp_count = match sp in file.
    cat_count = cat_count - sp

这是一种蛮力的一般解决方案。如果我们在Trie中安排模式,应该能够提出线性时间解决方案。 Root - > f - > i - > s - > h - > a等等。 现在,当你处于鱼的h,并且你没有得到a时,增加fish_count并转到root。如果你得到一个'继续。任何时候你得到一些不期望的东西,增加最近找到的模式的数量,然后转到root或转到其他节点(最长的匹配前缀,即该另一个节点的后缀)。这是Aho-Corasick算法,您可以在维基百科或以下地址查找: http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf

此解决方案与文件中的字符数呈线性关系。

答案 2 :(得分:1)

<强>定义

通缉项目:正在文本中搜索的字符串。

要计算想要的项目,而不在需要较长的项目中重新计算它们,首先计算每个项目在字符串中出现的次数。接下来,浏览所需的项目,从最长到最短,当您遇到较长的需要项目中出现的较小的需要项目时,从较短的项目中减去较长项目的结果数量。例如,假设您想要的项目是&#34; a&#34;,&#34; a b&#34;和&#34; ab c&#34;,您的文字是&#34; a / a / ab / ab c&#34;。搜索每个单独产生:{&#34; a&#34;:4,&#34; a b&#34;:2,&#34; a b c&#34;:1}。期望的结果是:{&#34; ab c&#34;:1,&#34; a b&#34;:#(&#34; a b&#34;) - #(&#34; ab c&# 34;)= 2 - 1 = 1,&#34; a&#34;:#(&#34; a&#34;) - #(&#34; ab c&#34;) - #(&#34; a b&#34;)= 4 - 1 - 1 = 2}。

def get_word_counts(text, wanted):
    counts = {}; # The number of times a wanted item was read

    # Dictionary mapping word lengths onto wanted items
    #  (in the form of a dictionary where keys are wanted items)
    lengths = {}; 

    # Find the number of times each wanted item occurs
    for item in wanted:
        matches = re.findall('\\b' + item + '\\b', text);

        counts[item] = len(matches)

        l = len(item) # Length of wanted item

        # No wanted item of the same length has been encountered
        if (l not in lengths):
            # Create new dictionary of items of the given length
            lengths[l] = {}

        # Add wanted item to dictionary of items with the given length
        lengths[l][item] = 1

    # Get and sort lenths of wanted items from largest to smallest
    keys = lengths.keys()
    keys.sort(reverse=True)

    # Remove overlapping wanted items from the counts working from
    #  largest strings to smallest strings
    for i in range(1,len(keys)):
        for j in range(0,i):
            for i_item in lengths[keys[i]]:
                for j_item in lengths[keys[j]]:
                    #print str(i)+','+str(j)+': '+i_item+' , '+j_item
                    matches = re.findall('\\b' + i_item + '\\b', j_item);

                    counts[i_item] -= len(matches) * counts[j_item]

    return counts

以下代码包含测试用例:

tests = [
    {
        'text': 'test file with words fish, steak fish chips fish and '+
            'chips and fries',
        'wanted': ["fish and chips","fish","chips","steak"]
    },
    {
        'text': 'fish, fish and chips, fish and chips and burgers',
        'wanted': ["fish and chips","fish","fish and chips and burgers"]
    },
    {
        'text': 'fish, fish and chips and burgers',
        'wanted': ["fish and chips","fish","fish and chips and burgers"]
    },
    {
        'text': 'My fish and chips and burgers. My fish and chips and '+
            'burgers',
        'wanted': ["fish and chips","fish","fish and chips and burgers"]
    },
    {
        'text': 'fish fish fish',
        'wanted': ["fish fish","fish"]
    },
    {
        'text': 'fish fish fish',
        'wanted': ["fish fish","fish","fish fish fish"]
    }
]

for i in range(0,len(tests)):
    test = tests[i]['text']
    print test
    print get_word_counts(test, tests[i]['wanted'])
    print ''

输出如下:

test file with words fish, steak fish chips fish and chips and fries
{'fish and chips': 1, 'steak': 1, 'chips': 1, 'fish': 2}

fish, fish and chips, fish and chips and burgers
{'fish and chips': 1, 'fish and chips and burgers': 1, 'fish': 1}

fish, fish and chips and burgers
{'fish and chips': 0, 'fish and chips and burgers': 1, 'fish': 1}

My fish and chips and burgers. My fish and chips and burgers
{'fish and chips': 0, 'fish and chips and burgers': 2, 'fish': 0}

fish fish fish
{'fish fish': 1, 'fish': 1}

fish fish fish
{'fish fish fish': 1, 'fish fish': 0, 'fish': 0}