Question

给出的是熊猫数据框（长格式）：

df = pd.DataFrame(columns=
    ['entity', 'attribute', 'value'], 
    data=[
    ['hat','color','black'],
    ['hat','shape','round'],
    ['flower','color','blue'],
    ['flower','cls','setosa']])

┌────────┬───────────┬────────┐
│ entity │ attribute │ value  │
├────────┼───────────┼────────┤
│ hat    │ color     │ black  │
│ hat    │ shape     │ round  │
│ flower │ color     │ blue   │
│ flower │ cls       │ setosa │
└────────┴───────────┴────────┘

在运行时之前，实体或属性（或其计数）均未知。我想创建一个宽泛的表格，其中包含所有属性，如果某些实体不具有某些属性，则为NaN：

┌────────┬───────┬───────┬────────┐
│ entity │ color │ shape │ cls    │
├────────┼───────┼───────┼────────┤
│ hat    │ black │ round │ NaN    │
│ flower │ blue  │ NaN   │ setosa │
└────────┴───────┴───────┴────────┘

创建宽数据框：

wide = df[['entity']].drop_duplicates().set_index('entity')
for col in df['attribute'].drop_duplicates():
    wide[col] = np.NaN

wide == 
┌────────┬───────┬───────┬───────┐
│ entity │ color │ shape │ class │
├────────┼───────┼───────┼───────┤
│ hat    │ NaN   │ NaN   │ NaN   │
│ flower │ NaN   │ NaN   │ NaN   │
└────────┴───────┴───────┴───────┘

将值从长df复制到wide

for row in df.itertuples(index=False):
    wide.loc[row.entity, row.attribute] = row.value

有什么办法可以避免循环仍可读的代码？（不过数据框确实很小）。

阅读the great pivot post后，

EDIT ：

df.pivot_table(
    values='value', index='entity', columns='attribute',
    aggfunc=lambda x: ' '.join(x.astype(str)))

在我的情况下，

df不应包含重复项，因此，透视表的group_by部分无关。我选择了concatenation，它将使故障条目至少可见（与max不同）。

从长到宽，列名未知的先验

0 个答案: