熊猫 - 处理多个列似乎很慢

时间:2017-06-25 23:18:47

标签: python performance csv pandas lambda

使用Pandas处理大型csv时遇到一些麻烦。 Csv由一个索引和大约其他450列的3个组组成,如下所示:

    cola1    colb1   colc1   cola2    colb2   colc2   cola3    colb3   colc3
1  stra_1  ctrlb_1  retc_1  stra_1  ctrlb_1  retc_1  stra_1  ctrlb_1  retc_1
2  stra_2  ctrlb_2  retc_2  stra_2  ctrlb_2  retc_2  stra_2  ctrlb_2  retc_2
3  stra_3  ctrlb_3  retc_3  stra_3  ctrlb_3  retc_3  stra_3  ctrlb_3  retc_3

对于每个三列,我想“分析B列(它是一种”控制字段“,根据其值,我应该通过处理col A和C返回一个值。

最后,我需要返回从150到1的所有结果列的串联。

我已尝试过申请,但似乎太慢了(处理50k行10分钟)。

df['Path'] = df.apply(lambda x: getFullPath(x), axis=1)

您可以在此处找到示例功能: https://pastebin.com/S9QWTGGV

我尝试提取可乐,colb,colc的独特组合列表 - 预处理列表 - 并应用map来生成结果并加速一点:

for i in range(1,151):
  df['Concat' + str(i)] = df['cola' + str(i)] + '|' + df['colb' + str(i)] + '|' + df['colc' + str(i)]

concats = []
for i in range(1,151):
  concats.append('Concat' + str(i))

ret = df[concats].values.ravel()
uniq = list(set(ret))
list = {}

for member in ret:
  list[member] = getPath2(member)

for i in range(1,MAX_COLS + 1):
  df['Res' + str(i)] = df['Concat' + str(i)].map(list)

df['Path'] = df.apply(getFullPath2,axis=1)

函数getPath和getFullPath2在此定义为示例: https://pastebin.com/zpFF2wXD

但它看起来仍然有点慢(处理所有东西需要6分钟) 您对如何加快csv处理有什么建议吗? 我甚至不知道我使用“连接”列的方式是否更好:),尝试使用Series.cat,但我没有得到如何链接只有一些列而不是完整的df

非常感谢! 话筒

1 个答案:

答案 0 :(得分:1)

修正后的答案:我从您的标准看,您实际上每列都有多个控件。我认为可行的是将这些分成3个数据帧,应用您的映射如下:

def foo(a, b, c):
    return call_other(a+1,b+1,c+1)

这给出了以下结果:

import pandas as pd

series = {
  'cola1': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
  'colb1': pd.Series(['ret1','ret1','ret2'],index=[1,2,3]),
  'colc1': pd.Series(['B_1','C_2','B_3'],index=[1,2,3]),
  'cola2': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
  'colb2': pd.Series(['ret3','ret1','ret2'],index=[1,2,3]),
  'colc2': pd.Series(['B_2','A_1','A_3'],index=[1,2,3]),
  'cola3': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
  'colb3': pd.Series(['ret2','ret2','ret1'],index=[1,2,3]),
  'colc3': pd.Series(['A_1','B_2','C_3'],index=[1,2,3]),
}

your_df = pd.DataFrame(series, index=[1,2,3], columns=['cola1','colb1','colc1','cola2','colb2','colc2','cola3','colb3','colc3'])

# Split your dataframe into three frames for each column type
bframes = your_df[[col for col in your_df.columns if 'colb' in col]]
aframes = your_df[[col for col in your_df.columns if 'cola' in col]]
cframes = your_df[[col for col in your_df.columns if 'colc' in col]]
for df in [bframes, aframes, cframes]:
    df.columns = ['col1','col2','col3']

# Mapping criteria
def map_colb(c):
    if c == 'ret1':
        return 'A'
    elif c == 'ret2':
        return None
    else:
        return 'F'

def map_cola(a):
      if a.startswith('D_'):
        return 'D'
      else:
        return 'E'

def map_colc(c):
    if c.startswith('B_'):
        return 'B'
    elif c.startswith('C_'):
        return 'C'
    elif c.startswith('A_'):
        return None
    else:
        return 'F'
# Use it on each frame
aframes = aframes.applymap(map_cola)
bframes = bframes.applymap(map_colb)
cframes = cframes.applymap(map_colc)

# The trick here is filling 'None's from the left to right in order of precedence
final = bframes.fillna(cframes.fillna(aframes))
# Then just combine them using whatever delimiter you like
# df.values.tolist() turns a row into a list
pathlist = ['|'.join(item) for item in final.values.tolist()]