如何使用python pandas将CSV解析为我想要的格式?

时间:2015-02-25 10:12:44

标签: python csv pandas

我是python pandas的新手。我有一个像这样的CSV文件:

insectName   count   weather  location   time        date      Condition
  aaa         15      sunny   balabala  0900:1200   1990-02-10     25
  bbb         10      sunny   balabala  0900:1200   1990-02-10     25
  ccc         20      sunny   balabala  0900:1200   1990-02-10     25
  ddd         50      sunny   balabala  0900:1200   1990-02-10     25
  ...        ...      ...      ...        ...            ...       ...
  XXX         40      sunny   balabala  1300:1500   1990-02-15     38
  yyy         10      sunny   balabala  1300:1500   1990-02-15     38
  yyy         25      sunny   balabala  1300:1500   1990-02-15     38

该文件包含许多数据,每天都可以重复使用insectName。 我希望将数据使用转换为' date' ,连续一天。 像这样:

insectName  count  insectName  count  insectName  count  weather  location  time        date      Condition
  ccc         20      bbb       10       aaa        15    sunny   balabala  0900:1200   1990-02-10     25
  yyy         25      yyy       10       XXX        40    sunny   balabala  1300:1500   1990-02-15     38
  ...        ...      ...      ...       ...        ...    ...      ...        ...            ...        ...     

我该怎么办?

1 个答案:

答案 0 :(得分:0)

有一种groupby/cumcount/unstack技巧可将长格式DataFrame转换为宽格式DataFrame:

import pandas as pd
df = pd.read_table('data', sep='\s+')

common = ['weather', 'location', 'time', 'date', 'Condition']
grouped = df.groupby(common)
df['idx'] = grouped.cumcount()
df2 = df.set_index(common+['idx'])
df2 = df2.unstack('idx')
df2 = df2.swaplevel(0, 1, axis=1)
df2 = df2.sortlevel(axis=1)
df2.columns = df2.columns.droplevel(0)
df2 = df2.reset_index()
print(df2)

产量

  weather  location       time        date  Condition insectName  count  \
0   sunny  balabala  0900:1200  1990-02-10         25        aaa     15   
1   sunny  balabala  1300:1500  1990-02-15         38        XXX     40   

  insectName  count insectName  count insectName  count  
0        bbb     10        ccc     20        ddd     50  
1        yyy     10        yyy     25        NaN    NaN  

虽然宽格式可能对演示有用,但请注意长格式 通常是正确的数据处理格式。见Hadley Wickham的article on the virtues of tidy data (PDF)