Question

我正在使用形状为〜(100000, 50)的pandas数据帧，虽然我可以实现所需的数据格式化和操作，但我发现我的代码需要比运行所需的时间更长（3-10分钟），具体取决于具体的任务，包括：

合并不同列中的字符串
将函数应用于数据框系列中的每个实例
检查某个值是否包含在单独的列表或numpy数组中

我将来会有更大的数据框，并希望确保我使用适当的编码方法来避免很长的处理时间。我发现我的for循环时间最长。我尝试使用列表推导和系列运算符（例如for）来避免df.loc[:,'C'] = df.A + df.B循环，但在某些情况下，我需要使用嵌套的for循环执行更复杂/相关的操作。例如，下面迭代数据框的系列history（一系列列表），然后遍历每个list中的每个项目：

for row in DF.iterrows():

    removelist = []

    for i in xrange(0, len(row[1]['history'])-1):
        if ((row[1]['history'][i]['title'] == row[1]['history'][i+1]['title']) & 
            (row[1]['history'][i]['dept'] == row[1]['history'][i+1]['dept']) & 
            (row[1]['history'][i]['office'] == row[1]['history'][i+1]['office']) & 
            (row[1]['history'][i]['employment'] == row[1]['history'][i+1]['employment'])):
                removelist.append(i)

    newlist = [v for i, v in enumerate(row[1]['history']) if i not in removelist]

我知道列表推导可以容纳嵌套的for循环，但上面的列表理解中看起来真的很麻烦。

我的问题：我可以使用哪些其他技术来实现与for循环相同的功能，并缩短处理时间？在迭代包含列表的系列时，我应该使用除嵌套for循环之外的其他技术吗？

Answer 1

那么你在这里看到的是一个数据框，每行的历史记录条目包含一个字典列表？像：

import pandas as pd
john_history = [{'title': 'a', 'dept': 'cs'}, {'title': 'cj', 'dept': 'sales'}]
john_history
jill_history = [{'title': 'boss', 'dept': 'cs'}, {'title': 'boss', 'dept': 'cs'}, {'title': 'junior', 'dept': 'cs'}]
jill_history
df = pd.DataFrame({'history': [john_history, jill_history], 
    'firstname': ['john', 'jill']})

我会重新构建您的数据，以便您在结构的底层使用pandas结构，例如DataFrames的一个词典，其中每个DataFrame都是历史记录（我不认为Panel在这里工作，因为DataFrames可能有不同的长度）：

john_history = pd.DataFrame({'title': ['a', 'cj'], 'dept': ['cs', 'sales']})
john_history['name'] = 'john'
jill_history = pd.DataFrame({'title': ['boss', 'boss', 'junior'], 'dept': ['cs', 'cs', 'cs']})
jill_history['name'] = 'jill'
people = pd.concat([john_history, jill_history])

然后您可以使用groupby处理它们，如：

people.groupby('name').apply(pd.DataFrame.drop_duplicates)

通常，如果在pandas / numpy中找不到所需的功能，您会发现使用pandas原语创建它而不是迭代数据帧会更快。例如，要重新创建上面的逻辑，首先要创建一个新的数据帧，这是第一个被移位的数据框：

df2 = df.shift()

现在，您可以通过比较数据框的内容并仅保留不同的内容并使用它来过滤数据框来创建选择：

selection_array = (df.history == df2.history) & (df.title == df2.title)
unduplicated_consecutive = df[~selection_array]
print(unduplicated_consecutive)
  history  id title
0       a   1     x
1       b   2     y
# or in one line:
df[~((df.history == df2.history) & (df.title == df2.title))]
# or:
df[(df.history != df2.history) | (df.title != df2.title)]

所以把它放到groupby中：

def drop_consecutive_duplicates(df):
    df2 = df.shift()
    return df.drop(df[(df.dept == df2.dept) & (df.title == df2.title)].index)

people.groupby('name').apply(drop_consecutive_duplicates)

更快地循环来操纵Pandas中的数据

1 个答案: