Question

我的问题有点棘手。我将巨大的数据文件分解为多个块，并对每个块多次应用模糊代码。之后，我将结果整理到一个文件中。我想知道是否可以应用某种循环来重用代码，而不是为每个变量编写代码。下面是示例。

df = pd.read_csv('dec 10.csv')
df1 = df.iloc[0:20000]
df2 = df.iloc[20000:40000]
df3 = df.iloc[40000:60000]
match1 = df1['Customer Name'].map(lambda x: difflib.get_close_matches(x, df1['Customer Name'].values, n=2, cutoff=0.8)).apply(pd.Series).dropna(axis=0)
match2 = df2['Customer Name'].map(lambda x: difflib.get_close_matches(x, df2['Customer Name'].values, n=2, cutoff=0.8)).apply(pd.Series).dropna(axis=0)
match3 = df3['Customer Name'].map(lambda x: difflib.get_close_matches(x, df3['Customer Name'].values, n=2, cutoff=0.8)).apply(pd.Series).dropna(axis=0)


a = match1.append(match2, ignore_index =True)
b = a.append(match3, ignore_index =True)

我正在寻找一种优化方法来编写匹配代码，而不是为每个数据块编写匹配代码，然后在以后进行整理。

Answer 1

您可以遍历数据帧列表，以便在每次迭代时仅引用df并避免重复代码：

match = pd.Dataframe()
for df in [df1,df2,df3]:
    match_ = df['Customer Name'].map(lambda x: difflib
                 .get_close_matches(x, df['Customer Name'].values, n=2, cutoff=0.8))
                 .apply(pd.Series).dropna(axis=0)
    match = match.append(match_, ignore_index =True)

Answer 2

首先，您可以像这样将某些东西分成长度为n的组

dfgroups = [df[x:x+n] for x in range(0, len(df), n)]

用20000代替n，您将获得每个最多20,000个的块。然后，您可以为dfgroups中的每个项目循环编码。另外，您还希望matches是可以添加到其自己的列表。最后，出于可读性考虑，对于这么长的一行，您可能只想编写一个mapper函数，而不要使用大量的lambda。

将所有内容放在一起，可以像这样重写代码。

df = pd.read_csv('dec 10.csv')

# split df into groups of 20,000
dfgroups = [df[x:x+20000] for x in range(0, len(df), 20000)]
matches = [] # empty list to store matches

for dfgroup in dfgroups:

    # a function to replace that long line, more readable
    # this function will get redefined every loop, using the new `dfgroup` each iteration
    # this is optional, and you can instead keep that long line, replacing `df` with `dfgroup`
    def mapper(x):
        values = dfgroup['Customer Name'].values
        result = difflib.get_close_matches(x, values, n=2, cutoff=0.8))
        result = result.apply(pd.Series)
        result = result.dropna(axis=0)
        return result

    match = group['Customer Name'].map(mapper) # passing the function as an argument rather than using a lambda
    matches.append(match) # append it to the matches list

现在matches等效于[match1, match2, match3, ...]，可以像matches[0] matches[1]等使用

拆分大量数据并为所有数据块循环相同的代码

2 个答案: