Question

我想进行模糊匹配，将大型数据帧（130.000行）的列中的字符串与列表（400行）进行匹配。我写的代码是在一个小样本（匹配3000行到400行）上测试的，并且工作正常。这里复制太大了，但它大致如下：

1）列的数据标准化 2）创建列的笛卡尔积并计算Levensthein距离 3）选择最高得分匹配并存储＆＃39; large_csv_name＆＃39;在单独的列表中。 4）比较＆＃39; large_csv_names＆＃39;到＆＃39; large_csv＆＃39;，取出所有交叉数据并写入csv。

由于笛卡尔积包含超过5000万条记录，因此我很快就会遇到内存错误。

这就是为什么我想知道如何将大数据集分成块然后运行我的脚本。

到目前为止，我已经尝试过：

df_split = np.array_split(df, x (e.g. 50 of 500))
for i in df_split:
  (step 1/4 as above)

以及：

for chunk in pd.read_csv('large_csv.csv', chunksize= x (e.g. 50 or 500))
  (step 1/4 as above)

这些方法似乎都不起作用。我想知道如何在块中运行模糊匹配，即将大块csv切成块，运行代码，取一块，运行代码等。

Answer 1

与此同时，我编写了一个脚本，以块的形式分割数据帧，然后可以进一步处理每个数据帧。由于我是python的新手，代码可能有点乱，但我仍然希望与那些可能遇到同样问题的人分享它。

import pandas as pd
import math 


partitions = 3    #number of ways to split df
length = len(df)

list_index = list(df.index.values)
counter = 0     #var that will be used to stop slicing when df ends
block_counter0 = 0      #var which will indicate the begin index of slice                                                              
block_counter1 = block_counter0 + math.ceil(length/partitions)  #likewise
while counter < int(len(list_index)):      #stop slicing when df ends
    df1 = df.iloc[block_counter0:block_counter1]  #temp df that forms chunk
    for i in range(block_counter0, block_counter1 ):

        #insert operations on row of df1 here

    counter += 1  #increase counter by 1 to stop slicing in time
    block_counter0 = block_counter1   #when for loop ends indices areupdated
    if block_counter0 + math.ceil(length / partitions) > 
           int(len(list_index)):
      block_counter1 = len(list_index)
      counter +=1
    else:
      block_counter1 = block_counter0 + math.ceil(length / partitions)

处理大型Pandas Dataframes（模糊匹配）

1 个答案: