给出表格
| A | B | C | C | C | D | D |
1 0 x y z 8 9
2 4 x b
返回的最佳方法有哪些
| A | B | C | D |
1 0 x 8
1 0 y 8
1 0 z 8
1 0 x 9
1 0 y 9
1 0 z 9
2 4 x
2 4 b
我正在使用pandas read_csv从csv中撤出...不确定我是否可以在那里,或使用SQL或Python dicts处理它。
努力搜索,找不到答案。
(我是新手,所以我可能会遗漏一些基本的东西......)
编辑:需要容纳 n 行
答案 0 :(得分:2)
import pandas as pd
df = pd.DataFrame([[1,0,'x','y','z',8,9]], columns=list('ABCCCDD'))
result = pd.MultiIndex.from_product(
[grp for key, grp in df.T.groupby(level=0)[0]]).to_frame(index=False)
print(result)
产量
0 1 2 3
0 1 0 x 8
1 1 0 x 9
2 1 0 y 8
3 1 0 y 9
4 1 0 z 8
5 1 0 z 9
如果您的DataFrame有多行:
import numpy as np
import pandas as pd
def row_to_arrays(row, idx):
"""
Split a row into a list of component arrays.
idx specifies the indices at which we want to split the row
"""
# Use row[1:] because the first item in each row is the index
# (which we want to ignore)
result = np.split(row[1:], idx)
# Filter out empty strings
result = [arr[arr != ''] for arr in result]
# Filter out empty arrays
result = [arr for arr in result if len(arr)]
return result
def arrays_to_dataframe(arrays):
"""
Convert list of arrays to product DataFrame
"""
return pd.MultiIndex.from_product(arrays).to_frame(index=False)
def df_to_row_product(df):
# find the indices at which to cut each row
idx = pd.DataFrame(df.columns).groupby(0)[0].agg(lambda x: x.index[0])[1:]
data = [arrays_to_dataframe(row_to_arrays(row, idx))
for row in df.itertuples()]
result = pd.concat(data, ignore_index=True).fillna('')
return result
df = pd.DataFrame([[1,0,'x','y','z',8,9],
[2,4,'x','b','','','']], columns=list('ABCCCDD'))
print(df_to_row_product(df))
产量
0 1 2 3
0 1 0 x 8
1 1 0 x 9
2 1 0 y 8
3 1 0 y 9
4 1 0 z 8
5 1 0 z 9
6 2 4 x
7 2 4 b
答案 1 :(得分:1)
我可以想到一个可能的解决方案,使用一点预处理和itertools.product
:
from itertools import product
prod = list(product(*df.groupby(df.columns, axis=1)\
.apply(lambda x: x.values.reshape(-1, )).tolist()))
prod
[(1, 0, 'x', 8),
(1, 0, 'x', 9),
(1, 0, 'y', 8),
(1, 0, 'y', 9),
(1, 0, 'z', 8),
(1, 0, 'z', 9)]
df = pd.DataFrame(prod, columns=list('ABCD'))\
.sort_values('D').reset_index(drop=1)
df
A B C D
0 1 0 x 8
1 1 0 y 8
2 1 0 z 8
3 1 0 x 9
4 1 0 y 9
5 1 0 z 9