pandas切片和vstack交错数据帧

时间:2014-04-19 18:21:39

标签: python pandas dataframe

以下代码读取csv文件,其中数据的格式为[A B C D E F G H D E F G H D E F G H ...] 并将其转换为以相同顺序堆叠的[A B C D E F G H]

这是数据源

http://web.mta.info/developers/data/nyct/turnstile/turnstile_110507.txt

以下是单行

的示例结果
input_line = """A002,R051,02-00-00,05-21-11,00:00:00,REGULAR,003169391,001097585,05-21-11,04:00:00,REGULAR,003169415,001097588,05-21-11,08:00:00,REGULAR,003169431,001097607,05-21-11,12:00:00,REGULAR,003169506,001097686,05-21-11,16:00:00,REGULAR,003169693,001097734,05-21-11,20:00:00,REGULAR,003169998,001097769,05-22-11,00:00:00,REGULAR,003170119,001097792,05-22-11,04:00:00,REGULAR,003170146,001097801"""

output_lines = """
A002,R051,02-00-00,05-21-11,00:00:00,REGULAR,003169391,001097585
A002,R051,02-00-00,05-21-11,04:00:00,REGULAR,003169415,001097588
A002,R051,02-00-00,05-21-11,08:00:00,REGULAR,003169431,001097607
A002,R051,02-00-00,05-21-11,12:00:00,REGULAR,003169506,001097686
A002,R051,02-00-00,05-21-11,16:00:00,REGULAR,003169693,001097734
A002,R051,02-00-00,05-21-11,20:00:00,REGULAR,003169998,001097769
A002,R051,02-00-00,05-22-11,00:00:00,REGULAR,003170119,001097792
A002,R051,02-00-00,05-22-11,04:00:00,REGULAR,003170146,001097801
"""




for name in filenames:
    with open(name, "rb") as f, open("updated_" + name, "wb") as fw:
        reader = csv.reader(f)
        writer = csv.writer(fw)
        for row in reader:
             header = row[0:3]
             readings = [row[x:x+5] for x in range(3, len(row), 5)]
             for elem in readings:
                 writer.writerow(header + elem)

data wrangling

有没有办法用pandas和数据帧切片来做到这一点?

1 个答案:

答案 0 :(得分:1)

无论如何都无法下载完整的数据集。是仅供MTA内部使用吗?

第一个,第二个和第三个颜色在一个文件中是否总是相同?这是以下解决方案的假设:

如果每一行包含具有相同第1列到第3列的条目,则需要进行小的修改:基本上是使用以下方法为每行生成数据帧,然后将它们合并为一个。

如果一行包含多个ABCDEFGH,则需要一些更好的方法。

In [68]:

df=input_line.split(',')
df_1stpt=df[:8]  #the leading row
df_2ndpt=np.array(df[8:]).reshape((-1,5)) #get the rest rows into the right shape
df_1stpt=pd.DataFrame(df_1stpt).T #create a dataframe containing the leading row
df_2ndpt=pd.DataFrame(df_2ndpt,columns=range(3,8)) #create a DF of the rest rows, with the right col idx
df_rst=df_1stpt.append(df_2ndpt, ignore_index=True) #put them together
df_rst.ix[:,[0,1,2]]=df_rst.ix[0,[0,1,2]].values #fill the nan's
In [69]:

print df_rst
      0     1         2         3         4        5          6          7
0  A002  R051  02-00-00  05-21-11  00:00:00  REGULAR  003169391  001097585
1  A002  R051  02-00-00  05-21-11  04:00:00  REGULAR  003169415  001097588
2  A002  R051  02-00-00  05-21-11  08:00:00  REGULAR  003169431  001097607
3  A002  R051  02-00-00  05-21-11  12:00:00  REGULAR  003169506  001097686
4  A002  R051  02-00-00  05-21-11  16:00:00  REGULAR  003169693  001097734
5  A002  R051  02-00-00  05-21-11  20:00:00  REGULAR  003169998  001097769
6  A002  R051  02-00-00  05-22-11  00:00:00  REGULAR  003170119  001097792
7  A002  R051  02-00-00  05-22-11  04:00:00  REGULAR  003170146  001097801

[8 rows x 8 columns]