我需要分析Brightkite网络及其checkins。基本上我必须计算在每个位置办理登机手续的不同用户的数量。我只是当我在小文件上运行这段代码时(刚从原始文件剪切300条第一行)它运行良好。但如果我尝试对原始文件做同样的事情。我收到了错误
users.append(columns[4])
IndexError: list index out of range. What it could be
这是我的代码:
from collections import Counter
f = open("b.txt")
locations = []
users = []
for line in f:
columns = line.strip().split("\t")
locations.append(columns[0])
users.append(columns[4])
l = Counter(locations)
ml = l.most_common(10)
print ml
这是数据结构
58186 2008-12-03T21:09:14Z 39.633321 -105.317215 ee8b88dea22411
58186 2008-11-30T22:30:12Z 39.633321 -105.317215 ee8b88dea22411
58186 2008-11-28T17:55:04Z -13.158333 -72.531389 e6e86be2a22411
58186 2008-11-26T17:08:25Z 39.633321 -105.317215 ee8b88dea22411
58187 2008-08-14T21:23:55Z 41.257924 -95.938081 4c2af967eb5df8
58187 2008-08-14T07:09:38Z 41.257924 -95.938081 4c2af967eb5df8
58187 2008-08-14T07:08:59Z 41.295474 -95.999814 f3bb9560a2532e
58187 2008-08-14T06:54:21Z 41.295474 -95.999814 f3bb9560a2532e
58188 2010-04-06T06:45:19Z 46.521389 14.854444 ddaa40aaa22411
58188 2008-12-30T15:30:08Z 46.522621 14.849618 58e12bc0d67e11
58189 2009-04-08T07:36:46Z 46.554722 15.646667 ddaf9c4ea22411
58190 2009-04-08T07:01:28Z 46.421389 15.869722 dd793f96a22411
答案 0 :(得分:1)
你应该使用csv模块并随时更新计数器:
from collections import Counter
import csv
with open("Brightkite_totalCheckins.txt") as f:
r = csv.reader(f,delimiter="\t")
cn = Counter()
users = []
for row in r:
# update Counter as you go, no need to build another list
# locations is row[4] not row[0]
cn[row[4]] += 1
# same as columns[]
users.append(row[0])
print(cn.most_common(10))
完整档案的输出:
[('00000000000000000000000000000000', 254619), ('ee81ef22a22411ddb5e97f082c799f59', 17396), ('ede07eeea22411dda0ef53e233ec57ca', 16896), ('ee8b1d0ea22411ddb074dbd65f1665cf', 16687), ('ee78cc1ca22411dd9b3d576115a846a7', 14487), ('eefadd1aa22411ddb0fd7f1c9c809c0c', 12227), ('ecceeae0a22411dd831d5f56beef969a', 10731), ('ef45799ca22411dd9236df37bed1f662', 9648), ('d12e8e8aa22411dd90196fa5c210e3cc', 9283), ('ed58942aa22411dd96ff97a15c29d430', 8640)]
如果使用repr打印行,则会看到该文件是以制表符分隔的:
'7611\t2009-08-30T11:07:52Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-30T00:15:20Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-29T20:28:13Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-29T15:53:59Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-29T15:19:36Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-29T15:16:45Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
'7611\t2009-08-29T11:52:32Z\t53.6\t-2.55\td138ebbea22411ddbd3a4b5ab989b9d0\n'
..................
最后一行是:
'58227\t2009-01-21T00:24:35Z\t33.833333\t35.833333\t9f6b83bca22411dd85460384f67fcdb0\n'
因此请确保匹配并且您尚未修改文件,并且不存在indexError。
您的代码失败,因为您有一些看似'7573\t\t\t\t\n'
的行,第一行是行号1909858
,因此拆分和剥离会使您['7573']
。
但是,使用csv文件会为您提供['7573', '', '', '', '']
。
如果您确实需要十个唯一身份位置的列表,则需要找到等于1
的值:
# generator expression of key/values where value == 1
unique = (tup for tup in cn.iteritems() if tup[1] == 1)
from itertools import islice
# take first 10 elements from unique
sli = list(islice(unique,10))
print(sli)
('2d4920e7273c755704c06f2201832d89', 1), ('a4ef963e84f83133484227465e2113e9', 1), ('474f93a6585111dea018003048c10834', 1), ('413754d668b411de9a19003048c0801e', 1), ('d115daaca22411ddb75a33290983eb13', 1), ('4bac110041ad11de8fca003048c0801e', 1), ('fc706c121ec1f54e0a828548ac5e26b8', 1), ('1bcd0cf0f0bd11ddb822003048c0801e', 1), ('e6ed6c09b8994ed125f3c5ef6c210844', 1), ('493ef9b049cfb2c6c24667a931f1592172074545', 1)]
为了获得所有唯一位置的计数,我们可以使用我们的生成器表达式的其余部分,sum
为每个元素添加1,并将总数添加到我们使用islice的长度:
print(sum(1 for _ in unique) + len(sli))
为您提供426831
个唯一位置。
使用re.split
或str.split
并不会有明显的原因:
In [13]: re.split("\s+", '7573\t\t\t\t\n'.rstrip())
Out[13]: ['7573']
In [14]: '7573\t\t\t\t\n'.rstrip().split()
Out[14]: ['7573']
答案 1 :(得分:0)
问题在于您的数据,我检查了您提供的网站数据。数据实际上不是由制表符空格分隔的。他们只是空格分开。我添加了一些行来替换带有制表符的空格然后拆分该行。它现在有效。
from collections import Counter
f = open("b.txt")
locations = []
users = []
for line in f:
line = line.replace(" ","\t")
line = line.replace(" ","\t")
line = line.replace(" ","\t")
line = line.replace(" ","\t")
columns = line.strip().split("\t")
locations.append(columns[0])
users.append(columns[4])
l = Counter(locations)
ml = l.most_common(10)
print ml
注意:如果您有这样的错误检查您的数据,则错误消息显示没有第4个元素。
我希望这是您要解决的错误。