Question

我有两个文件：

metadata.csv：包含ID，后跟供应商名称，文件名等
hashes.csv：包含一个ID，后跟一个哈希该ID本质上是一种外键，将文件元数据与其哈希相关联。

我编写了这个脚本来快速提取出与特定供应商相关的所有哈希值。它在完成处理hashes.csv

stored_ids = []

# this file is about 1 MB
entries = csv.reader(open(options.entries, "rb"))

for row in entries:
  # row[2] is the vendor
  if row[2] == options.vendor:
    # row[0] is the ID
    stored_ids.append(row[0])

# this file is 1 GB
hashes = open(options.hashes, "rb")

# I iteratively read the file here,
# just in case the csv module doesn't do this.
for line in hashes:

  # not sure if stored_ids contains strings or ints here...
  # this probably isn't the problem though
  if line.split(",")[0] in stored_ids:

    # if its one of the IDs we're looking for, print the file and hash to STDOUT
    print "%s,%s" % (line.split(",")[2], line.split(",")[4])

hashes.close()

此脚本在停止之前通过hashes.csv获取大约2000个条目。我究竟做错了什么？我以为我是在逐行处理它。

PS。 csv文件是流行的HashKeeper格式，我正在解析的文件是NSRL哈希集。 http://www.nsrl.nist.gov/Downloads.htm#converter

更新：下面的工作解决方案。谢谢所有评论的人！

entries = csv.reader(open(options.entries, "rb"))   
stored_ids = dict((row[0],1) for row in entries if row[2] == options.vendor)

hashes = csv.reader(open(options.hashes, "rb"))
matches = dict((row[2], row[4]) for row in hashes if row[0] in stored_ids)

for k, v in matches.iteritems():
    print "%s,%s" % (k, v)

Answer 1

“Craps out”并不是特别好的描述。它有什么作用？它交换了吗？填满所有记忆？或者只是在没有做任何事情的情况下吃CPU？

但是，只是一开始，使用词典而不是stored_ids的列表。在字典中搜索通常在O（1）时间内完成，而在列表中搜索是O（n）。

编辑：这是一个微不足道的微观基准：

$ python -m timeit -s "l=range(1000000)" "1000001 in l"
10 loops, best of 3: 71.1 msec per loop
$ python -m timeit -s "s=set(range(1000000))" "1000001 in s"
10000000 loops, best of 3: 0.174 usec per loop

正如您所看到的，一个集合（具有与dict相同的性能特征）在一百万个整数中的搜索速度比类似列表快10000多倍（远小于一微秒，而每次查找几乎100毫秒）。请考虑对1GB文件的每一行进行这样的查找，并了解问题的大小。

Answer 2

此代码会在任何没有至少4个逗号的行上死掉;例如，它会死在空行上。如果您确定不想使用csv阅读器，那么至少赶上IndexError上的line.split(',')[4]

Answer 3

请解释停止是什么意思？它挂起或退出？有没有错误追溯？

a）在没有“，”

的任何一行都会失败

>>> 'hmmm'.split(",")[2]
Traceback (most recent call last):
  File "<string>", line 1, in <string>
IndexError: list index out of range

b）为什么要多次分割线，而不是这样做

tokens = line.split(",")

if len(tokens) >=5 and tokens[0] in stored_ids:
    print "%s,%s" % (tokens[2], tokens[4])

c）创建stored_id的dict，因此stored_id中的标记[0]会很快

d）将你的内部代码包装在try / exept中，看看是否有任何错误

e）你在命令行或某个IDE上运行它？

Answer 4

在数组中搜索需要O（n），所以使用dict代替

stored_ids = dict((row[0],1) for row in entries if row[2] == options.vendor)

或使用set

a=set(row[0] for row in entries if row[2] == options.vendor)
b=set(line.split(",")[0] for line in hashes)
c=a.intersection(b)

在c中，您只会找到两个哈希值和csv

的字符串

Python在迭代处理我的1GB csv文件时停止

4 个答案: