我是python pandas的新手。我有一个像这样的CSV文件:
insectName count weather location time date Condition
aaa 15 sunny balabala 0900:1200 1990-02-10 25
bbb 10 sunny balabala 0900:1200 1990-02-10 25
ccc 20 sunny balabala 0900:1200 1990-02-10 25
ddd 50 sunny balabala 0900:1200 1990-02-10 25
... ... ... ... ... ... ...
XXX 40 sunny balabala 1300:1500 1990-02-15 38
yyy 10 sunny balabala 1300:1500 1990-02-15 38
yyy 25 sunny balabala 1300:1500 1990-02-15 38
该文件包含许多数据,每天都可以重复使用insectName。 我希望将数据使用转换为' date' ,连续一天。 像这样:
insectName count insectName count insectName count weather location time date Condition
ccc 20 bbb 10 aaa 15 sunny balabala 0900:1200 1990-02-10 25
yyy 25 yyy 10 XXX 40 sunny balabala 1300:1500 1990-02-15 38
... ... ... ... ... ... ... ... ... ... ...
我该怎么办?
答案 0 :(得分:0)
有一种groupby/cumcount/unstack
技巧可将长格式DataFrame转换为宽格式DataFrame:
import pandas as pd
df = pd.read_table('data', sep='\s+')
common = ['weather', 'location', 'time', 'date', 'Condition']
grouped = df.groupby(common)
df['idx'] = grouped.cumcount()
df2 = df.set_index(common+['idx'])
df2 = df2.unstack('idx')
df2 = df2.swaplevel(0, 1, axis=1)
df2 = df2.sortlevel(axis=1)
df2.columns = df2.columns.droplevel(0)
df2 = df2.reset_index()
print(df2)
产量
weather location time date Condition insectName count \
0 sunny balabala 0900:1200 1990-02-10 25 aaa 15
1 sunny balabala 1300:1500 1990-02-15 38 XXX 40
insectName count insectName count insectName count
0 bbb 10 ccc 20 ddd 50
1 yyy 10 yyy 25 NaN NaN
虽然宽格式可能对演示有用,但请注意长格式 通常是正确的数据处理格式。见Hadley Wickham的article on the virtues of tidy data (PDF)。