像:
A B C D
1 1 2 3 ['a','b']
2 4 6 7 ['b','c']
3 1 0 1 ['a']
4 2 1 1 ['b']
5 1 2 3 []
为:
A B C D
1 1 2 3 ['a']
2 1 2 3 ['b']
3 4 6 7 ['b']
4 4 6 7 ['c']
5 1 0 1 ['a']
6 2 1 1 ['b']
7 1 2 3 []
ps:将行分成“D”并延长行
使用:pandas dataframe处理数据
答案 0 :(得分:3)
一种方法是使用带有双嵌套for循环的列表推导:
>>> [(key + (item,))
for key, val in df.set_index(['A','B','C'])['D'].iteritems()
for item in map(list, val) or [[]]]
# [(1, 2, 3, ['a']),
# (1, 2, 3, ['b']),
# (4, 6, 7, ['b']),
# (4, 6, 7, ['c']),
# (1, 0, 1, ['a']),
# (2, 1, 1, ['b']),
# (1, 2, 3, [])]
将此表单中的数据传递给pd.DataFrame
会产生所需的结果:
import pandas as pd
df = pd.DataFrame({'A': {1: 1, 2: 4, 3: 1, 4: 2, 5: 1},
'B': {1: 2, 2: 6, 3: 0, 4: 1, 5: 2},
'C': {1: 3, 2: 7, 3: 1, 4: 1, 5: 3},
'D': {1: ['a', 'b'], 2: ['b', 'c'], 3: ['a'], 4: ['b'], 5: []}})
result = pd.DataFrame(
[(key + (item,))
for key, val in df.set_index(['A','B','C'])['D'].iteritems()
for item in map(list, val) or [[]]])
产量
0 1 2 3
0 1 2 3 [a]
1 1 2 3 [b]
2 4 6 7 [b]
3 4 6 7 [c]
4 1 0 1 [a]
5 2 1 1 [b]
6 1 2 3 []
另一种选择是使用df['D'].apply
将列表中的项目展开到不同的列中,然后使用stack
展开行:
df = pd.DataFrame({'A': {1: 1, 2: 4, 3: 1, 4: 2, 5: 1},
'B': {1: 2, 2: 6, 3: 0, 4: 1, 5: 2},
'C': {1: 3, 2: 7, 3: 1, 4: 1, 5: 3},
'D': {1: ['a', 'b'], 2: ['b', 'c'], 3: ['a'], 4: ['b'], 5: []}})
df = df.set_index(['A', 'B', 'C'])
result = df['D'].apply(lambda x: pd.Series(map(list, x) if x else [[]]))
# 0 1
# A B C
# 1 2 3 [a] [b]
# 4 6 7 [b] [c]
# 1 0 1 [a] NaN
# 2 1 1 [b] NaN
# 1 2 3 [] NaN
result = result.stack()
# A B C
# 1 2 3 0 [a]
# 1 [b]
# 4 6 7 0 [b]
# 1 [c]
# 1 0 1 0 [a]
# 2 1 1 0 [b]
# 1 2 3 0 []
# dtype: object
result.index = result.index.droplevel(-1)
result = result.reset_index()
# A B C 0
# 0 1 2 3 [a]
# 1 1 2 3 [b]
# 2 4 6 7 [b]
# 3 4 6 7 [c]
# 4 1 0 1 [a]
# 5 2 1 1 [b]
# 6 1 2 3 []
虽然这不使用显式for-loop
或列表推导,但在apply
的调用中隐藏了隐式for循环。事实上,它比使用列表理解要慢得多:
In [170]: df = pd.concat([df]*10)
In [171]: %%timeit
.....: result = df['D'].apply(lambda x: pd.Series(map(list, x) if x else [[]]))
result = result.stack()
result.index = result.index.droplevel(-1)
result = result.reset_index()
100 loops, best of 3: 11.5 ms per loop
In [172]: %%timeit
.....: result = pd.DataFrame(
[(key + (item,))
for key, val in df['D'].iteritems()
for item in map(list, val) or [[]]])
1000 loops, best of 3: 618 µs per loop
答案 1 :(得分:1)
假设您的column
D
内容属于type
string
:
print(type(df.loc[0, 'D']))
<class 'str'>
df = df.set_index(['A', 'B', 'C']).sortlevel()
df.loc[:, 'D'] = df.loc[:, 'D'].str.strip('[').str.strip(']')
df = df.loc[:, 'D'].str.split(',', expand=True).stack()
df = df.str.strip('').apply(lambda x: '[{}]'.format(x)).reset_index().drop('level_3', axis=1).rename(columns={0: 'D'})
A B C D
0 1 0 1 ['a']
1 1 2 3 ['a']
2 1 2 3 ['b']
3 1 2 3 []
4 2 1 1 ['b']
5 4 6 7 ['b']
6 4 6 7 ['c']