Question

我有一个由单词组成的文件，每行一个单词。该文件如下所示：

aaa
bob
fff
err
ddd
fff
err

我想计算一对一出现的单词的频率。

例如，

aaa,bob: 1
bob,fff:1
fff,err:2

等等。我试过这个

f=open(file,'r')
content=f.readlines()
f.close()
dic={}
it=iter(content)
for line in content:
    print line, next(line);
    dic.update({[line,next(line)]: 1})

我收到了错误：

TypeError: str object is not an iterator

然后我尝试使用迭代器：

it=iter(content)
for x in it:
    print x, next(x);

再次出现同样的错误。请帮忙！

Answer 1

你只需要跟踪上一行，一个文件对象返回它自己的迭代器，这样你就根本不需要 iter 或 readlines ，调用< em> next 一次创建变量 prev ，然后继续在循环中更新 prev ：

from collections import defaultdict

d = defaultdict(int)

with open("in.txt") as f:
    prev = next(f).strip()
    for line in map(str.strip,f): # python2 use itertools.imap
        d[prev, line] += 1
        prev = line

哪会给你：

defaultdict(<type 'int'>, {('aaa', 'bob'): 1, ('fff', 'err'): 2, ('err', 'ddd'): 1, ('bob', 'fff'): 1, ('ddd', 'fff'): 1})

Answer 2

line与所有strs一样，是能，这意味着它有__iter__方法。但是next适用于 ators ，它有__next__方法（在Python 2中它是next方法）。当解释程序执行next(line)时，它会尝试调用line.__next__。由于line没有__next__方法，因此会引发TypeError: str object is not an iterator。

由于line是一个能够且具有__iter__方法，我们可以设置it = iter(line)。 it使用__next__方法是 ator ，next(it)会返回line中的下一个字符。但是你正在寻找文件中的下一行，所以尝试类似：

from collections import defaultdict

dic = defaultdict(int)
with open('file.txt') as f:
    content = f.readlines()
    for i in range(len(content) - 1):
        key = content[i].rstrip() + ',' + content[i+1].rstrip()
        dic[key] += 1

for k,v in dic.items():
    print(k,':',v)

输出（ file.txt 与OP一样）

err,ddd : 1
ddd,fff : 1
aaa,bob : 1
fff,err : 2
bob,fff : 1

Answer 3

from collections import Counter
with open(file, 'r') as f:
    content = f.readlines()
result = Counter((a, b) for a, b in zip(content[0:-1], content[1:]))

这将是一个字典，其键是线对（按顺序），其值是该对发生的次数。

Answer 4

正如其他人所说，行是一个字符串，因此不能与 next（）方法一起使用。此外，您无法使用列表作为字典的键，因为它们是可清除的。你可以改用元组。一个简单的解决方案：

f=open(file,'r')
content=f.readlines()
f.close()

dic={}

for i in range(len(content)-1):
    print(content[i], content[i+1])
    try:
        dic[(content[i], content[i+1])] += 1
    except KeyError:
        dic[(content[i], content[i+1])] = 1

另请注意，通过使用 readlines（），您还可以保留＆＃39; \ n＆＃39;每一行。您可能想先将其剥离：

    content = []
    with open(file,'r') as f:
        for line in f:
            content.append(line.strip('\n'))

Answer 5

您可以使用2行deque和Counter：

from collections import Counter, deque

lc=Counter()
d=deque(maxlen=2)
with open(fn) as f:
    d.append(next(f))
    for line in f:
        d.append(line)
        lc+=Counter(["{},{}".format(*[e.rstrip() for e in d])])

>>> lc
Counter({'fff,err': 2, 'ddd,fff': 1, 'bob,fff': 1, 'aaa,bob': 1, 'err,ddd': 1})

您还可以使用regex进行捕捉：

with open(fn) as f:
    lc=Counter((m.group(1)+','+m.group(2),) for m in re.finditer(r"(\w+)\n(?=(\w+))", f.read()))

Answer 6

您的值x包含字符串'ddd / ccc / etc'。它没有下一个。 next()属于迭代器，它用于从迭代器获取下一个元素。调用它的正确方法是it.next()

it=iter(content)
for x in it:
    print x, it.next();

但是在完成使用迭代器中的所有元素后，您将获得异常。因此，您需要捕获StopIteration异常。

for x in it:
    try:
        line, next_line = x, it.next()
        # do your count logic overhere
    except StopIteration:
        break

dic.update({[line,next_line]: 1})不起作用。您将跳过可能的组合。

Answer 7

正如其他人所提到的，你不能在一个字符串的行上使用next。您可以使用itertools.tee从文件对象创建两个独立的迭代器，然后使用collections.Counter和zip从行对创建计数器对象

from itertools import tee
from collections import Counter
with open('test.txt') as f:
    # f = (line.rstrip() for line in f) # if you don't want the trailing new lines 
    f, ne = tee(f)
    next(ne)
    print(Counter(zip(f, ne)))

请注意，由于文件对象在其尾部包含带有换行符的行，如果您不希望它可以删除行。

TypeError：str对象不是迭代器

7 个答案: