我正在Python 2.6(*)中慢慢开发一个数据处理应用程序。我的测试数据非常小,例如5000个案例,但预计在不久的将来会有一百万个案例,我想知道我目前的方法在这些条件下是否可行。
问题的结构: 我有两个csv文件,一个包含调用(5000行,20列)和另一个调用详细信息(500行,10列)。我必须构建第三个csv文件,其中包含“调用”文件中的所有案例,其中包含找到的其他详细信息。在幕后有一些繁重的工作(详细列表中的数据合并和重组,列表之间的数据比较)。 但是我对构建输出列表非常紧张:目前代码看起来像这样:
def reduceOutputListToPossibleMatches(outputList, detailsList):
reducedList = list()
for outputItem in outputList:
isFound = False
for detailsItem in detailsList:
if detailsItem[14] == outputItem[4]:
if isfound:
detailsItem[30] = "1" #ambigous case
# - more than one match was found
# 1 is an indicator for true - I am not using python here because spss has no support for booleans.
isFound = True
if isFound:
reducedList.append(detailsItem )
return reducedList
我认为这个算法需要很长时间,因为我必须循环两个大型列表。 所以我的问题可以归结为:Python中的列表有多快,是否有更好的替代方案?另外:双列表处理起来有点不方便,因为我必须记住每列的索引位置 - 是否有更好的选择?
* =我稍后会调用SPSS版本19,它拒绝使用较新版本的python。
答案 0 :(得分:6)
从Elazar的回答中,使用dict来避免内循环:
def reduceOutputListToPossibleMatches(outputList, detailsList):
details = {}
for detailsItem in detailsList:
key = detailsItem[14]
if key in details:
details[key][30] = "1"
else:
details[key] = detailsItem
for outputItem in outputList:
key = outputItem[4]
if key in details:
yield details[key]
res = reduceOutputListToPossibleMatches(outputList, detailsList)
with open('somefile', 'w') as f:
f.writelines(res)
如果你需要所有模糊的界限:
def reduceOutputListToPossibleMatches(outputList, detailsList):
details = {}
for detailsItem in detailsList:
key = detailsItem[14]
if key in details:
details[key].append(detailsItem)
else:
details[key] = [detailsItem]
for outputItem in outputList:
key = outputItem[4]
if key in details:
for item in details[key]:
if len(details[key]) > 1:
item[30] = "1"
yield item
res = reduceOutputListToPossibleMatches(outputList, detailsList)
with open('somefile', 'w') as f:
f.writelines(res)
答案 1 :(得分:2)
我认为您不需要返回list
。你可以这样做:
def reduceOutputListToPossibleMatches(outputList, detailsList):
for outputItem in outputList:
isFound = False
for detailsItem in detailsList:
if detailsItem[14] == outputItem[4]: #there was a syntax error here
if isfound:
detailsItem[30] = "1"
break
isFound = True
else:
yield detailsItem
res = reduceOutputListToPossibleMatches(outputList, detailsList)
with open('somefile', 'w') as f:
f.writelines(res)
但它仍然O(n**2)
这很慢。也许SQL数据库(通过Django?)更适合这项任务。
@ Duncan建议的细微变化:
from collections import defaultdict
def reduceOutputListToPossibleMatches(outputList, detailsList):
details = defaultdict(list)
for detailsItem in detailsList:
key = detailsItem[14]
details[key].append(detailsItem)
for outputItem in outputList:
val = details[outputItem[4]]
if len(val) > 1:
for item in val:
item[30] = "1"
yield from val