创建示例数据帧的代码:
Sample = [{'account': 'Jones LLC', 'Jan': 150, 'Feb': 200, 'Mar': [[.332, .326], [.058, .138]]},
{'account': 'Alpha Co', 'Jan': 200, 'Feb': 210, 'Mar': [[.234, .246], [.234, .395]]},
{'account': 'Blue Inc', 'Jan': 50, 'Feb': 90, 'Mar': [[.084, .23], [.745, .923]]}]
df = pd.DataFrame(Sample)
示例数据框可视化:
df:
account Jan Feb Mar
Jones LLC | 150 | 200 | [.332, .326], [.058, .138]
Alpha Co | 200 | 210 | [[.234, .246], [.234, .395]
Blue Inc | 50 | 90 | [[.084, .23], [.745, .923]
我正在寻找一个将Jan和Feb列合并为一个数组的公式,在这个数组的New列中输出。
预期产出:
df:
account Jan Feb Mar New
Jones LLC | 150 | 200 | [.332, .326], [.058, .138] | [150, 200]
Alpha Co | 200 | 210 | [[.234, .246], [.234, .395] | [200, 210]
Blue Inc | 50 | 90 | [[.084, .23], [.745, .923] | [50, 90]
答案 0 :(得分:8)
使用values.tolist
df.assign(New=df[['Feb', 'Jan']].values.tolist())
# inplace... use this
# df['New'] = df[['Feb', 'Jan']].values.tolist()
Feb Jan Mar account New
0 200 150 [[0.332, 0.326], [0.058, 0.138]] Jones LLC [200, 150]
1 210 200 [[0.234, 0.246], [0.234, 0.395]] Alpha Co [210, 200]
2 90 50 [[0.084, 0.23], [0.745, 0.923]] Blue Inc [90, 50]
数据量更大的时间
使用3,000行数据帧,避免apply
的速度提高了60多倍。
df = pd.concat([df] * 1000, ignore_index=True)
%timeit df.assign(New=df[['Feb', 'Jan']].values.tolist())
%timeit df.assign(New=df.apply(lambda x: [x['Jan'], x['Feb']], axis=1))
1000 loops, best of 3: 947 µs per loop
10 loops, best of 3: 61.7 ms per loop
对于30,000行数据帧,速度提高160倍
df = pd.concat([df] * 10000, ignore_index=True)
100 loops, best of 3: 3.58 ms per loop
1 loop, best of 3: 586 ms per loop
答案 1 :(得分:7)
如果您正在寻找速度,这就是您的选择。
df['New'] = [[x, y] for x, y in zip(df.Jan, df.Feb)]
df
Feb Jan Mar account New
0 200 150 [[0.332, 0.326], [0.058, 0.138]] Jones LLC [150, 200]
1 210 200 [[0.234, 0.246], [0.234, 0.395]] Alpha Co [200, 210]
2 90 50 [[0.084, 0.23], [0.745, 0.923]] Blue Inc [50, 90]
如果要删除原始列,可以使用
df.drop(['Jan', 'Feb'], axis=1, inplace=True)
df.apply
axis=1
这是为了完成 - 我不再宽恕apply
的使用。
df['New'] = df.apply(lambda x: [x['Jan'], x['Feb']], axis=1)
df
Feb Jan Mar account New
0 200 150 [[0.332, 0.326], [0.058, 0.138]] Jones LLC [150, 200]
1 210 200 [[0.234, 0.246], [0.234, 0.395]] Alpha Co [200, 210]
2 90 50 [[0.084, 0.23], [0.745, 0.923]] Blue Inc [50, 90]
<强>性能强>
重复piR对小数据(3000行)的测试,包括列表理解方法,我们有 -
%timeit df.assign(New=df[['Feb', 'Jan']].values.tolist())
%timeit df.assign(New=df.apply(lambda x: [x['Jan'], x['Feb']], axis=1))
%timeit df.assign(New=[[x, y] for x, y in zip(df.Jan, df.Feb)])
2.76 ms ± 596 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
152 ms ± 9.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.59 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
对于较大的数据(30,000行) -
5.95 ms ± 527 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.53 s ± 165 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
8.79 ms ± 793 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
列表理解和.tolist()
都是竞争方法。您决定使用哪一个是品味问题。 不使用apply
!
答案 2 :(得分:5)
您还可以尝试df['New'] = list(zip(df.Feb, df.Jan))
或使用tolist
df['New'] = df.ix[:,0:2].values.tolist()