拆分大量数据并为所有数据块循环相同的代码

时间:2018-12-10 10:11:20

标签: python pandas loops

我的问题有点棘手。我将巨大的数据文件分解为多个块,并对每个块多次应用模糊代码。之后,我将结果整理到一个文件中。我想知道是否可以应用某种循环来重用代码,而不是为每个变量编写代码。下面是示例。

df = pd.read_csv('dec 10.csv')
df1 = df.iloc[0:20000]
df2 = df.iloc[20000:40000]
df3 = df.iloc[40000:60000]
match1 = df1['Customer Name'].map(lambda x: difflib.get_close_matches(x, df1['Customer Name'].values, n=2, cutoff=0.8)).apply(pd.Series).dropna(axis=0)
match2 = df2['Customer Name'].map(lambda x: difflib.get_close_matches(x, df2['Customer Name'].values, n=2, cutoff=0.8)).apply(pd.Series).dropna(axis=0)
match3 = df3['Customer Name'].map(lambda x: difflib.get_close_matches(x, df3['Customer Name'].values, n=2, cutoff=0.8)).apply(pd.Series).dropna(axis=0)


a = match1.append(match2, ignore_index =True)
b = a.append(match3, ignore_index =True)

我正在寻找一种优化方法来编写匹配代码,而不是为每个数据块编写匹配代码,然后在以后进行整理。

2 个答案:

答案 0 :(得分:1)

您可以遍历数据帧列表,以便在每次迭代时仅引用df并避免重复代码:

match = pd.Dataframe()
for df in [df1,df2,df3]:
    match_ = df['Customer Name'].map(lambda x: difflib
                 .get_close_matches(x, df['Customer Name'].values, n=2, cutoff=0.8))
                 .apply(pd.Series).dropna(axis=0)
    match = match.append(match_, ignore_index =True)

答案 1 :(得分:1)

首先,您可以像这样将某些东西分成长度为n的组

dfgroups = [df[x:x+n] for x in range(0, len(df), n)]

20000代替n,您将获得每个最多20,000个的块。然后,您可以为dfgroups中的每个项目循环编码。另外,您还希望matches是可以添加到其自己的列表。最后,出于可读性考虑,对于这么长的一行,您可能只想编写一个mapper函数,而不要使用大量的lambda。

将所有内容放在一起,可以像这样重写代码。

df = pd.read_csv('dec 10.csv')

# split df into groups of 20,000
dfgroups = [df[x:x+20000] for x in range(0, len(df), 20000)]
matches = [] # empty list to store matches

for dfgroup in dfgroups:

    # a function to replace that long line, more readable
    # this function will get redefined every loop, using the new `dfgroup` each iteration
    # this is optional, and you can instead keep that long line, replacing `df` with `dfgroup`
    def mapper(x):
        values = dfgroup['Customer Name'].values
        result = difflib.get_close_matches(x, values, n=2, cutoff=0.8))
        result = result.apply(pd.Series)
        result = result.dropna(axis=0)
        return result

    match = group['Customer Name'].map(mapper) # passing the function as an argument rather than using a lambda
    matches.append(match) # append it to the matches list

现在matches等效于[match1, match2, match3, ...],可以像matches[0] matches[1]等使用