迭代大型列表时更快的双循环方法(18,895个元素)

时间:2015-02-26 08:53:22

标签: python list python-2.7 csv for-loop

以下是代码:

import csv
import re

with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
    open('cities2.txt', 'rb') as cities, \
    open('drug_rehab_city_state.csv', 'wb') as out_csv:
    writer = csv.writer(out_csv, delimiter = ",")
    reader = csv.reader(csv_f)
    city_lst = cities.readlines()

    for row in reader:
        for city in city_lst:
            city = city.strip()
            match = re.search((r'\b{0}\b').format(city), row[0])
            if match:
                writer.writerow(row)
                break

“alcohol_rehab_ltp.csv”有145行,“cities2.txt”有18,895行(转换为列表时变为18,895行)。这个过程需要一段时间才能运行,我没有时间,但也许大约5分钟。我在这里可以看到一些简单(或更复杂)的东西,这可以使这个脚本运行得更快。我将使用其他.csv文件来运行“cities.txt”的大文本文件,这些csv文件可能有多达1000行。任何有关如何加快速度的想法都将受到赞赏! 这是csv文件:关键字(144),平均。 CPC,本地搜索,广告客户竞争

[alcohol rehab san diego],$49.54,90,High
[alcohol rehab dallas],$86.48,110,High
[alcohol rehab atlanta],$60.93,50,High
[free alcohol rehab centers],$11.88,110,High
[christian alcohol rehab centers],–,70,High
[alcohol rehab las vegas],$33.40,70,High
[alcohol rehab cost],$57.37,110,High

来自文本文件的一些行:

san diego
dallas
atlanta
dallas
los angeles
denver

5 个答案:

答案 0 :(得分:2)

我认为你可以使用set和索引:

with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
    open('cities2.txt', 'rb') as cities, \
    open('drug_rehab_city_state.csv', 'wb') as out_csv:
    writer = csv.writer(out_csv, delimiter = ",")
    space = ""
    reader = csv.reader(csv_f)
    # make set of all city names, lookups are 0(1)
    city_set = {line.rstrip() for line in cities}
    output_list = []
    header = next(reader) # skip header
    for row in reader:
        try:
            # names are either first or last with two words preceding or following 
            # so split twice on whitespace from either direction
            if row[0].split(None,2)[-1].rstrip("]") in city_set or row[0].rsplit(None, 2)[0][1:] in city_set:
                output_list.append(row)
        except IndexError as e:
            print(e,row[0])
    writer.writerows(output_list)

现在,运行时间为0(n),而不是二次方。

答案 1 :(得分:2)

首先,正如@Shawn Zhang建议(r'\b{0}\b').format(c.strip())可以在外部循环,并且您可以创建结果列表,以避免在每次迭代中写入文件。

其次,您可以尝试re.compile来编译正则表达式,这可能会提高您在正则表达式上的表现。

第三,尝试对其进行分析以找出瓶颈,例如:如果您有SciPy,请使用timeit或其他分析器,例如ica

另外,如果city始终位于第一列,并且我认为它被命名为“City”,为什么不使用csv.DictReader()来读取csv?我相信它比正则表达更快。

修改

正如你提供的文件示例我摆脱了re(因为看起来你真的不需要它们),使用以下代码获得的速度提高了10倍以上:

import csv

with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
    open('cities2.txt', 'rb') as cities, \
    open('drug_rehab_city_state.csv', 'wb') as out_csv:
    writer = csv.writer(out_csv, delimiter = ",")
    output_list = []
    reader = csv.reader(csv_f)
    city_lst = cities.readlines()

    for row in reader:
        for city in city_lst:
            city = city.strip()
            if city in row[0]:
                output_list.append(row)
    writer.writerows(output_list)

答案 2 :(得分:2)

使用所有城市名称构建单个正则表达式:

city_re = re.compile(r'\b('+ '|'.join(c.strip() for c in cities.readlines()) + r')\b')

然后执行:

for row in reader:
    match = city_re.search(row[0])
    if match:
        writer.writerow(row)

这将使循环迭代的次数从18895 x 145减少到仅18895,正则表达式引擎在这145个城市名称上的字符串前缀匹配上做得最好。

为了您的方便和测试,以下是完整列表:

import csv
import re

with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
    open('cities2.txt', 'rb') as cities, \
    open('drug_rehab_city_state.csv', 'wb') as out_csv:
    writer = csv.writer(out_csv, delimiter = ",")
    reader = csv.reader(csv_f)

    city_re = re.compile(r'\b('+ '|'.join(c.strip() for c in cities.readlines()) + r')\b')

    for row in reader:
        match = city_re.search(row[0])
        if match:
            writer.writerow(row)

答案 3 :(得分:1)

即使我不认为循环/ IO是一个很大的瓶颈,但仍然可以尝试从它们开始。

我可以提供两个提示: (r'\b{0}\b').format(c.strip())可以在循环外部,这将提高一些性能,因为我们不必在每个循环中使用strip()格式化。

另外,您不必在每个循环中编写输出结果,而是可以创建结果列表ouput_list在循环期间保存结果并在循环后写入一次。

import csv
import re
import datetime

start = datetime.datetime.now()

with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
    open('cities2.txt', 'rb') as cities, \
    open('drug_rehab_city_state.csv', 'wb') as out_csv:
    writer = csv.writer(out_csv, delimiter = ",")
    space = ""
    reader = csv.reader(csv_f)
    city_lst = [(r'\b{0}\b').format(c.strip()) for c in cities.readlines()]
    output_list = []
    for row in reader:
        for city in city_lst:
            #city = city.strip()
            match = re.search(city, row[0])
            if match:
                output_list.append(row)
                break
    writer.writerows(output_list)



end = datetime.datetime.now()

print end -  start

答案 4 :(得分:1)

请注意,我认为您可以使用比使用re.search更好的方式查找行中的城市,因为通常城市将以空格分隔符分隔。否则,它的复杂度大于O(n * m)

一种方法是使用哈希表。

ht = [0]*MAX

阅读所有城市(假设这些城市数以千计)并填写哈希表

ht[hash(city)] = 1

现在,当您遍历阅读器中的每一行时,

for row in reader:
    for word in row:
        if ht[hash(word)] == 1:
            # found, do stuff here
            pass