Python错误检查脚本超级慢

时间:2013-02-13 21:23:19

标签: python

我有以下程序已经运行了大约两个小时,可能还有1/4。我的问题在代码之下:

import csv

input_csv = "LOCATION_ID.csv"
input2 = "CITIES.csv"
output_csv = "OUTPUT_CITIES.csv"

with open(input_csv, "rb") as infile:
    input_fields = ("ID", "CITY_DECODED", "CITY", "STATE", "COUNTRY", "SPELL1", "SPELL2", "SPELL3")
    reader = csv.DictReader(infile, fieldnames = input_fields)
    with open(input2, "rb") as infile2:
        input_fields2 = ("Latitude", "Longitude", "City")
        reader2 = csv.DictReader(infile2, fieldnames = input_fields2)
        next(reader2)
        words = []
        for next_row in reader2:
            words.append(next_row["City"])

        with open(output_csv, "wb") as outfile:
            output_fields = ("EXISTS","ID", "CITY_DECODED", "CITY", "STATE", "COUNTRY", "SPELL1", "SPELL2", "SPELL3")
            writer = csv.DictWriter(outfile, fieldnames = output_fields)
            writer.writerow(dict((h,h) for h in output_fields))
            next(reader)
            for next_row in reader:
                search_term = next_row["CITY_DECODED"]

                #I think the problem is here where I run through every city
                #in "words", even though all I want to know is if the city
                #in "search_term" exists in "words
                for item in words:
                    if search_term in words:
                        next_row["EXISTS"] = 1

                writer.writerow(next_row)

我在这里有几个问题:

1 鉴于input_csv有14k行而input2只有6k行,为什么这需要这么长时间?我知道最里面的for循环(从“for item in words:”开始)是低效的(参见qtn 3),但是我希望能更直观地了解幕后发生的事情,以便我(希望其他SO用户可以避免在我们的其他程序中犯同样的错误。

2 如果我希望此代码继续运行,这与我离开计算机并进入睡眠/休眠状态有什么关系?程序是否会在此时停止,但在计算机再次使用时会自行重新启动?我真的很想知道编译器运行一个程序后如何与操作系统交互,以及计算机与python程序相关的“进入睡眠状态”是什么意思。

和3 这段代码的更高效实现是什么?我认为这不应该花费几分钟来做错,我没错,对吗?

非常感谢!

1 个答案:

答案 0 :(得分:2)

让我们从一个效率低下的地方开始,我看到了:

for next_row in reader:
                search_term = next_row["CITY_DECODED"]
                for item in words:
                    if search_term in words:
                        next_row["EXISTS"] = 1

这是外部for循环的14k次迭代。然后,嵌套for循环中每次大约6k次迭代。当你执行if search_term in words时,更多次迭代,因为它迭代单词直到它返回。

我没有过多考虑这个算法实际上在做什么,但你至少应该删除words中的重复项(即words = list(set(words)))。

我即将发布关于那个小for item in words循环的帖子。令我感到困惑的是你为什么这么做,因为items从未使用过,因此for循环是一个很大的浪费时间。

最有可能简化为:

for next_row in reader:
    search_term = next_row["CITY_DECODED"]
    if search_term in words:
        next_row["EXISTS"] = 1
    writer.writerow(next_row)

所以,让我们总结一下你所拥有的所有迭代:

对于for next_row in reader2: words.append(next_row["City"])

~6k

〜{14}的for next_row in reader:次迭代乘以总和(i,1,6000),约为252亿。

取出无关循环可以为你提供大约8400万次迭代,这样就好了。