以下代码读取csv文件,其中数据的格式为[A B C D E F G H D E F G H D E F G H ...] 并将其转换为以相同顺序堆叠的[A B C D E F G H]
这是数据源
http://web.mta.info/developers/data/nyct/turnstile/turnstile_110507.txt
以下是单行
的示例结果input_line = """A002,R051,02-00-00,05-21-11,00:00:00,REGULAR,003169391,001097585,05-21-11,04:00:00,REGULAR,003169415,001097588,05-21-11,08:00:00,REGULAR,003169431,001097607,05-21-11,12:00:00,REGULAR,003169506,001097686,05-21-11,16:00:00,REGULAR,003169693,001097734,05-21-11,20:00:00,REGULAR,003169998,001097769,05-22-11,00:00:00,REGULAR,003170119,001097792,05-22-11,04:00:00,REGULAR,003170146,001097801"""
output_lines = """
A002,R051,02-00-00,05-21-11,00:00:00,REGULAR,003169391,001097585
A002,R051,02-00-00,05-21-11,04:00:00,REGULAR,003169415,001097588
A002,R051,02-00-00,05-21-11,08:00:00,REGULAR,003169431,001097607
A002,R051,02-00-00,05-21-11,12:00:00,REGULAR,003169506,001097686
A002,R051,02-00-00,05-21-11,16:00:00,REGULAR,003169693,001097734
A002,R051,02-00-00,05-21-11,20:00:00,REGULAR,003169998,001097769
A002,R051,02-00-00,05-22-11,00:00:00,REGULAR,003170119,001097792
A002,R051,02-00-00,05-22-11,04:00:00,REGULAR,003170146,001097801
"""
for name in filenames:
with open(name, "rb") as f, open("updated_" + name, "wb") as fw:
reader = csv.reader(f)
writer = csv.writer(fw)
for row in reader:
header = row[0:3]
readings = [row[x:x+5] for x in range(3, len(row), 5)]
for elem in readings:
writer.writerow(header + elem)
有没有办法用pandas和数据帧切片来做到这一点?
答案 0 :(得分:1)
无论如何都无法下载完整的数据集。是仅供MTA内部使用吗?
第一个,第二个和第三个颜色在一个文件中是否总是相同?这是以下解决方案的假设:
如果每一行包含具有相同第1列到第3列的条目,则需要进行小的修改:基本上是使用以下方法为每行生成数据帧,然后将它们合并为一个。
如果一行包含多个ABCDEFGH,则需要一些更好的方法。
In [68]:
df=input_line.split(',')
df_1stpt=df[:8] #the leading row
df_2ndpt=np.array(df[8:]).reshape((-1,5)) #get the rest rows into the right shape
df_1stpt=pd.DataFrame(df_1stpt).T #create a dataframe containing the leading row
df_2ndpt=pd.DataFrame(df_2ndpt,columns=range(3,8)) #create a DF of the rest rows, with the right col idx
df_rst=df_1stpt.append(df_2ndpt, ignore_index=True) #put them together
df_rst.ix[:,[0,1,2]]=df_rst.ix[0,[0,1,2]].values #fill the nan's
In [69]:
print df_rst
0 1 2 3 4 5 6 7
0 A002 R051 02-00-00 05-21-11 00:00:00 REGULAR 003169391 001097585
1 A002 R051 02-00-00 05-21-11 04:00:00 REGULAR 003169415 001097588
2 A002 R051 02-00-00 05-21-11 08:00:00 REGULAR 003169431 001097607
3 A002 R051 02-00-00 05-21-11 12:00:00 REGULAR 003169506 001097686
4 A002 R051 02-00-00 05-21-11 16:00:00 REGULAR 003169693 001097734
5 A002 R051 02-00-00 05-21-11 20:00:00 REGULAR 003169998 001097769
6 A002 R051 02-00-00 05-22-11 00:00:00 REGULAR 003170119 001097792
7 A002 R051 02-00-00 05-22-11 04:00:00 REGULAR 003170146 001097801
[8 rows x 8 columns]