Question

我需要分析Brightkite网络及其checkins。基本上我必须计算在每个位置办理登机手续的不同用户的数量。我只是当我在小文件上运行这段代码时（刚从原始文件剪切300条第一行）它运行良好。但如果我尝试对原始文件做同样的事情。我收到了错误

users.append(columns[4])
IndexError: list index out of range. What it could be

这是我的代码：

from collections import Counter
f = open("b.txt")
locations = []
users = []
for line in f:
    columns = line.strip().split("\t")
    locations.append(columns[0])
    users.append(columns[4])
l = Counter(locations)
ml = l.most_common(10)
print ml

这是数据结构

58186   2008-12-03T21:09:14Z    39.633321       -105.317215     ee8b88dea22411
58186   2008-11-30T22:30:12Z    39.633321       -105.317215     ee8b88dea22411
58186   2008-11-28T17:55:04Z    -13.158333      -72.531389      e6e86be2a22411
58186   2008-11-26T17:08:25Z    39.633321       -105.317215     ee8b88dea22411
58187   2008-08-14T21:23:55Z    41.257924       -95.938081      4c2af967eb5df8
58187   2008-08-14T07:09:38Z    41.257924       -95.938081      4c2af967eb5df8
58187   2008-08-14T07:08:59Z    41.295474       -95.999814      f3bb9560a2532e
58187   2008-08-14T06:54:21Z    41.295474       -95.999814      f3bb9560a2532e
58188   2010-04-06T06:45:19Z    46.521389       14.854444       ddaa40aaa22411
58188   2008-12-30T15:30:08Z    46.522621       14.849618       58e12bc0d67e11
58189   2009-04-08T07:36:46Z    46.554722       15.646667       ddaf9c4ea22411
58190   2009-04-08T07:01:28Z    46.421389       15.869722       dd793f96a22411

Answer 1

你应该使用csv模块并随时更新计数器：

from collections import Counter

import csv

with open("Brightkite_totalCheckins.txt") as f:
    r = csv.reader(f,delimiter="\t")
    cn = Counter()
    users = []
    for row in r:
        # update Counter as you go, no need to build another list
        # locations is row[4] not row[0]
        cn[row[4]] += 1
        # same as columns[]
        users.append(row[0])
print(cn.most_common(10))

完整档案的输出：

[('00000000000000000000000000000000', 254619), ('ee81ef22a22411ddb5e97f082c799f59', 17396), ('ede07eeea22411dda0ef53e233ec57ca', 16896), ('ee8b1d0ea22411ddb074dbd65f1665cf', 16687), ('ee78cc1ca22411dd9b3d576115a846a7', 14487), ('eefadd1aa22411ddb0fd7f1c9c809c0c', 12227), ('ecceeae0a22411dd831d5f56beef969a', 10731), ('ef45799ca22411dd9236df37bed1f662', 9648), ('d12e8e8aa22411dd90196fa5c210e3cc', 9283), ('ed58942aa22411dd96ff97a15c29d430', 8640)]

如果使用repr打印行，则会看到该文件是以制表符分隔的：

'7611\t2009-08-30T11:07:52Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-30T00:15:20Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-29T20:28:13Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-29T15:53:59Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-29T15:19:36Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-29T15:16:45Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-29T11:52:32Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
..................

最后一行是：

 '58227\t2009-01-21T00:24:35Z\t33.833333\t35.833333\t9f6b83bca22411dd85460384f67fcdb0\n'

因此请确保匹配并且您尚未修改文件，并且不存在indexError。

您的代码失败，因为您有一些看似'7573\t\t\t\t\n'的行，第一行是行号1909858，因此拆分和剥离会使您['7573']。但是，使用csv文件会为您提供['7573', '', '', '', '']。

如果您确实需要十个唯一身份位置的列表，则需要找到等于1的值：

# generator expression of key/values where value == 1
unique = (tup for tup in cn.iteritems() if tup[1] == 1)

from itertools import islice

# take first 10 elements from unique
sli = list(islice(unique,10))
print(sli)

('2d4920e7273c755704c06f2201832d89', 1), ('a4ef963e84f83133484227465e2113e9', 1), ('474f93a6585111dea018003048c10834', 1), ('413754d668b411de9a19003048c0801e', 1), ('d115daaca22411ddb75a33290983eb13', 1), ('4bac110041ad11de8fca003048c0801e', 1), ('fc706c121ec1f54e0a828548ac5e26b8', 1), ('1bcd0cf0f0bd11ddb822003048c0801e', 1), ('e6ed6c09b8994ed125f3c5ef6c210844', 1), ('493ef9b049cfb2c6c24667a931f1592172074545', 1)]

为了获得所有唯一位置的计数，我们可以使用我们的生成器表达式的其余部分，sum为每个元素添加1，并将总数添加到我们使用islice的长度：

print(sum(1 for _ in unique) + len(sli))

为您提供426831个唯一位置。

使用re.split或str.split并不会有明显的原因：

In [13]: re.split("\s+", '7573\t\t\t\t\n'.rstrip())
Out[13]: ['7573']

In [14]:  '7573\t\t\t\t\n'.rstrip().split()
Out[14]: ['7573']

Answer 2

问题在于您的数据，我检查了您提供的网站数据。数据实际上不是由制表符空格分隔的。他们只是空格分开。我添加了一些行来替换带有制表符的空格然后拆分该行。它现在有效。

from collections import Counter
f = open("b.txt")
locations = []
users = []
for line in f:
    line = line.replace("   ","\t")
    line = line.replace("    ","\t")
    line = line.replace("       ","\t")
    line = line.replace("     ","\t")
    columns = line.strip().split("\t")
    locations.append(columns[0])
    users.append(columns[4])
l = Counter(locations)
ml = l.most_common(10)
print ml

注意：如果您有这样的错误检查您的数据，则错误消息显示没有第4个元素。

我希望这是您要解决的错误。

将ellement追加到列表，错误列表索引超出范围

2 个答案: