给定所有两个连续单词的文本计数

时间:2017-04-14 12:33:36

标签: python python-3.x dictionary n-gram


输入

Once upon a time a time this upon a


输出:

dictionary {
    'Once upon': 1,
       'upon a': 2,
       'a time': 2,
       'time a': 1,
    'time this': 1,
    'this upon': 1
}


CODE:

def countTuples(path):
    dic = dict()
    with codecs.open(path, 'r', 'utf-8') as f:
        for line in f:
            s = line.split()
            for i in range (0, len(s)-1):
                dic[str(s[i]) + ' ' + str(s[i+1])] += 1
    return dic

我收到此错误:

File "C:/Users/user/Anaconda3/hw2.py", line 100, in countTuples
    dic[str(s[i]) + ' ' + str(s[i+1])] += 1
TypeError: list indices must be integers or slices, not str

如果我删除+=并且只放置=1一切正常,我想问题是当我尝试访问一个条目来提取一个尚不存在的值时?

我该怎么做才能解决这个问题?

3 个答案:

答案 0 :(得分:3)

您可以使用defaultdict来使解决方案正常运行。使用defaultdict,您可以指定键值对值的默认类型。这允许您对+=1之类的分配进行尚未明确创建的密钥:

import codecs
from collections import defaultdict

def countTuples(path):
    dic = defaultdict(int)
    with codecs.open(path, 'r', 'utf-8') as f:
        for line in f:
            s = line.split()
            for i in range (0, len(s)-1):
                dic[str(s[i]) + ' ' + str(s[i+1])] += 1
    return dic

>>> {'Once upon': 1,
     'a time': 2,
     'this upon': 1,
     'time a': 1,
     'time this': 1,
     'upon a': 2})

答案 1 :(得分:2)

只需最少更改代码即可使用defaultdict

from collections import defaultdict

line = 'Once upon a time a time this upon a'

dic = defaultdict(int)

s = line.split()

for i in range(0, len(s)-1):
    dic[str(s[i]) + ' ' + str(s[i+1])] += 1

这会产生:

dic

defaultdict(int,
            {'Once upon': 1,
             'a time': 2,
             'this upon': 1,
             'time a': 1,
             'time this': 1,
             'upon a': 2})

你的功能就变成了:

def countTuples(path):
    dic = defaultdict(int)
    with codecs.open(path, 'r', 'utf-8') as f:
        for line in f:
            s = line.split()
            for i in range (0, len(s)-1):
                dic[str(s[i]) + ' ' + str(s[i+1])] += 1
    return dic

答案 2 :(得分:2)

不需要那么难,只需使用Counter并使用zip将bigrams提供给计数器,例如:

from collections import Counter

def countTuples(path):
    dic = Counter()
    with codecs.open(path, 'r', 'utf-8') as f
        for line in f:
            s = line.split()
            dic.update('%s %s'%t for t in zip(s,s[1:]))
    return dic