Question

该计划的基础是将邮政编码（英国邮政编码）转换为坐标。所以我有一个带有大量邮政编码的文件（以及其他附加数据，如房价）和另一个包含所有英国邮政编码及其相关坐标的文件。

我将这两个转换为列表，然后在for循环中使用for循环迭代并比较任一文件中的postcode。如果file1中的邮政编码= =文件2中的邮政编码，则采用坐标并将其附加到相关文件中。

我已经按照我的意愿启动并运行了我的代码。我的所有测试都输出了我想要的内容。

唯一的问题是它只适用于小批量数据（我一直在测试持有~100行的.csv文件 - 创建100个内部列表的列表）。

现在我想将我的程序应用到我的整个数据集中。我跑了一次，什么都没发生。我走了，看了一些电视，但仍然没有发生任何事。 IDLE不会让我退出程序或任何东西。所以我重新启动并再次尝试，这次添加一个计数器，看看我的代码是否正在运行。我运行代码，计数器开始运行。直到它达到78902，我的数据集的大小。然后停止并且什么都不做。我无能为力，也无法关闭窗户。

令人讨厌的是，它甚至没有读过CSV文件，所以我无法操纵我的数据。

这是卡住的代码（代码的第一部分）：

    #empty variable to put the list into    
    lst = []
    # List function enables use for all files
    def create_list():

        #find the file
        file2 = input('enter filepath:')
        #read the file and iterate over it to append into the list
        with open(file2, 'r') as f:
            reader = csv.reader(f, delimiter=',')
            for row in reader:
                lst.append(row)
        return lst

所以有人知道让我的数据更易于管理的方法吗？

编辑：对于那些感兴趣的人是我的完整代码：

from tkinter.filedialog import asksaveasfile
import csv

new_file = asksaveasfile()

lst = []
# List function enables use for all files
def create_list():
    #empty variable to put the list into
    #find the file
    file2 = input('enter filepath:')
    #read the file and iterate over it to append into the list
    with open(file2, 'r') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            lst.append(row)
    return lst


def remove_space(lst):
    '''(lst)->lst
    Returns the postcode value without any whitespace

    >>> ac45 6nh
    ac456nh
    The above would occur inside a list inside a list
    '''
    filetype = input('Is this a sale or crime?: ')
    num = 0
    #check the filetype to find the position of the postcodes
    if filetype == 'sale':
        num = 3
        #iterate over the postcode to add all characters but the space
    for line in range(len(lst)):        
        pc = ''
        for char in lst[line][num]:
            if char != ' ':
                pc = pc+char
        lst[line][num] = pc

def write_new_file(lst, new_file):
    '''(lst)->.CSV file
    Takes a list and writes it into a .CSV file.
    '''
    writer = csv.writer(new_file, delimiter=',')
    writer.writerows(lst)
    new_file.close()


#conversion function
def find_coord(postcode):

    lst = create_list()
    #create python list for conversion comparison
    print(lst[0])
    #empty variables
    long = 0
    lat = 0
    #iterate over the list of postcodes, when the right postcode is found,
    # return the co-ordinates.
    for row in lst:
        if row[1] == postcode:
            long = row[2]
            lat = row[3]
    return str(long)+' '+str(lat)

def find_all_coord(postcode, file):

    #empty variables
    long = 0
    lat = 0
    #iterate over the list of postcodes, when the right postcode is found,
    # return the co-ordinates.
    for row in file:
        if row[1] == postcode:
            long = row[2]
            lat = row[3]
    return str(long)+' '+str(lat)

def convert_postcodes():
    '''
    take a list of lst = []
    #find the file
    file2 = input('enter filepath:')
    #read the file and iterate over it to append into the list
    with open(file2, 'r') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            lst.append(row)
    '''
    #save the files into lists so that they can be used
    postcodes = []
    with open(input('enter postcode key filepath:'), 'r') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            postcodes.append(row)
    print('enter filepath to be converted:')
    file = []
    with open(input('enter filepath to be converted:'), 'r') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            file.append(row)
    #here is the conversion code
    long = 0
    lat = 0
    matches = 0
    for row in range(len(file)):
        for line in range(len(postcodes)):
            if file[row][3] == postcodes[line][1]:
                long = postcodes[line][2]
                lat = postcodes[line][3]
                file[row].append(str(long)+','+str(lat))
                matches = matches+1
                print(matches)
    final_file = asksaveasfile()
    write_new_file(file, final_file)

我从IDLE单独调用这些函数，所以我可以在程序运行之前测试它。

Answer 1

你的问题是查找所有文件中的所有代码，进行大量的比较。

您可以尝试将其保存在dict中，并使用邮政编码作为密钥。

Answer 2

也许您应该使用sqlite3模块，在那里加载csv文件，并使用SQL进行连接？

Answer 3

循环遍历所有这些数据效率低下。

一个快速而肮脏的解决方案是使用SQLite或其他一些关系数据存储，您可以应用索引（如果这不能直接解决您的问题）。

对于此解决方案和其他解决方案，您可以在每个选项上使用timeit（）编写快速测试，并增加数据大小以识别响应。

Answer 4

如果您使用dict()代替list()，您的代码效率会更高。一般算法：

将数据加载到2个词典中：一个用于坐标，另一个用于您的信息。两者都有邮政编码作为关键。
通过最短的这些词典迭代，并为每个邮政编码在大字典中找到具有相同邮政编码的项目。保存匹配的邮政编码并在某处协调。

问题是dict()有O（1）time complexity来索引，而list()有O（n）搜索（这几乎与另一个相同）环）。对于大数据，这会产生巨大的差异，实际上您不需要双循环。

Answer 5

您的主要瓶颈在于convert_postcodes功能：

for row in range(len(file)):
    for line in range(len(postcodes)):

如果N中的file和M项目中有postcodes项，则此双循环需要M*N次迭代。

而是循环遍历邮政编码中的项目，并将数据映射后的代码保存到dict中的经度/纬度。然后循环file 一次并使用此dict为file中的每个项目提供所需的数据。这将完成M+N次迭代：

def convert_postcodes(postcode_path, file_path, output_path):
    postcodes = dict()
    with open(postcode_path, 'rb') as f:
        reader = csv.reader(f, delimiter=',')
        for row in reader:
            code, lng, lat = row[1:4]
            postcodes[code] = [lng, lat]
    with open(file_path, 'rb') as fin, open(output_path, 'wb') as fout:
        reader = csv.reader(fin, delimiter=',')
        writer = csv.writer(fout, delimiter=',')
        for row in reader:
            code = row[3]
            row.extend(postcodes[code])
            writer.writerow(row)

Python挣扎于数据？

5 个答案: