我正在尝试编写一个简单的脚本,将csv输出文件从Fortran代码转换为Pandas DataFrame对象,以便我可以进行更多分析。 csv有两列,但由多个附加的数据块组成,形状为[n,2](每个样本名称的格式为RN_x)。我得到了以下代码,但生成的DataFrame对象不允许分析。我在下面附上了一个示例文件(比原件大很多)。顺便说一下,数据文件中的第一列是日期,但是输出中的数字对应于si = imulation中的一天。任何建议都将不胜感激。
import numpy as np
import pandas as pd
import csv as csv
readdata = csv.reader(open('C:/data/Test.csv', 'r'))
data = []
for row in readdata:
data.append(row)
a = np.array(data).reshape(11,-1, order = 'F')
col = a[0,:4].reshape(4)
row = pd.Index(a[4:,0:1].reshape(7))
b = a[4:,5:]
df = pd.DataFrame(b, index = row, columns = col)
样品:
RN_48865,
1,Observed
1,0
259,Computed
1,0.000014
91,0.000014
182,0.000014
274,0.000014
366,0.000014
457,0.000014
548,0.000014
RN_7445,
1,Observed
1,0
259,Computed
1,0.000013
91,0.000013
182,0.000013
274,0.000013
366,0.000013
457,0.000013
548,0.000013
RN_9288,
1,Observed
1,0
259,Computed
1,0.000011
91,0.000011
182,0.000011
274,0.000011
366,0.000011
457,0.000011
548,0.000011
RN_10955,
1,Observed
1,0
259,Computed
1,0.000014
91,0.000014
182,0.000014
274,0.000014
366,0.000014
457,0.000014
548,0.000014
示例输出:
Index,RN_48865,RN_7445,RN_9288,RN_10955
1,0.000014,0.000013,0.000011,0.000014
91,0.000014,0.000013,0.000011,0.000014
182,0.000014,0.000013,0.000011,0.000014
274,0.000014,0.000013,0.000011,0.000014
366,0.000014,0.000013,0.000011,0.000014
457,0.000014,0.000013,0.000011,0.000014
548,0.000014,0.000013,0.000011,0.000014
答案 0 :(得分:1)
你其实是在问几个问题。这是我从所需输出中可以理解的:
source="""RN_48865,
1,Observed
1,0
259,Computed
1,0.000014
91,0.000014
182,0.000014
274,0.000014
366,0.000014
457,0.000014
548,0.000014
RN_7445,
1,Observed
1,0
259,Computed
1,0.000013
91,0.000013
182,0.000013
274,0.000013
366,0.000013
457,0.000013
548,0.000013
RN_9288,
1,Observed
1,0
259,Computed
1,0.000011
91,0.000011
182,0.000011
274,0.000011
366,0.000011
457,0.000011
548,0.000011
RN_10955,
1,Observed
1,0
259,Computed
1,0.000014
91,0.000014
182,0.000014
274,0.000014
366,0.000014
457,0.000014
548,0.000014
"""
import pandas as pd
import numpy as np
import StringIO
df = pd.read_csv(StringIO.StringIO(source), header=None)
rns = np.where(df[0].apply(lambda x: x.lstrip().startswith('RN_')))[0]
length = rns[1] - rns[0]
index = df[0].iloc[4:length]
cols = df[0][::length].apply(lambda x: x.lstrip()).values
result_df = pd.DataFrame(index=index)
for col_num, col_start in enumerate(range(0, len(df), length)):
result_df[cols[col_num]] = df[1][col_num*length+4 : (col_num+1)*length].values
print result_df
输出:
RN_48865 RN_7445 RN_9288 RN_10955
1 0.000014 0.000013 0.000011 0.000014
91 0.000014 0.000013 0.000011 0.000014
182 0.000014 0.000013 0.000011 0.000014
274 0.000014 0.000013 0.000011 0.000014
366 0.000014 0.000013 0.000011 0.000014
457 0.000014 0.000013 0.000011 0.000014
548 0.000014 0.000013 0.000011 0.000014
日期使用:
pandas.read_csv('file',
parse_date=0, # 0th column
date_parser=lambda x: pandas.Timestamp('1995-1-1')+timedelta(x))