我是python的新手,我有五个气候数据重复列表,我想将它们分成单独的重复。每个复制的长度为42734,数据帧的总长度(df)为213,674。
每个复制品由第一个条目为“replicate”的行分隔。我已经在分隔线上方显示了每列数据的标题。
Index year Month Day Rain Evap Max_Temp
42734 Replicate # 2 nan nan nan
我尝试了以下代码,这是非常笨重的,因为我必须生成100个气候重复,这是不切实际的。我知道有一种更简单的方法可以做到这一点,但我没有足够的经验使用python来解决它。 这是我写的代码:
# Import replicate .txt file into a dataframe
df=pd.read_table('5_replicates.txt',sep=r"\s*"
,skiprows=12,engine='python',header=None,
names =['year', 'Month', 'Day', 'Rain', 'Evap', 'Max_T'])
len(df)
i = 42734
num_replicates = 5
## Replicate 1
replicate_1 = df[0:i]
print "length of replicate_1:", len(replicate_1)
# Replicate 2
replicate_2 = df[i+1 : 2*i+1]
print "length of replicate_2:", len(replicate_2)
# Replicate 3
replicate_3 = df[2*i+2 : 3*i+2]
print "length of replicate_3:", len(replicate_3)
# Replicate 4
replicate_4 = df[3*i+3 : 4*i+3]
print "length of replicate_4:", len(replicate_4)
# Replicate 5
replicate_5 = df[4*i+4 : 5*i+4]
print "length of replicate_5:", len(replicate_5)
Any help would be much appreciated!
答案 0 :(得分:0)
## create the example data frame
df = pd.DataFrame({'year':pd.date_range(start = '2016-01-01', end='2017-01-01', freq='H'), 'rain':np.random.randn(8785), 'max_temp':np.random.randn(8785)})
df.year = df.year.astype(str) #make the year column of str type
## add index at which we enter replicate.
df.ix[np.floor(np.linspace(0,df.shape[0]-1, 5)), 'year'] = "Replicate"
In [7]: df.head()
Out[7]:
max_temp rain year
0 -1.068354 0.959108 Replicate
1 -0.219425 0.777235 2016-01-01 01:00:00
2 -0.262994 0.472665 2016-01-01 02:00:00
3 -1.761527 -0.515135 2016-01-01 03:00:00
4 -2.038738 -1.452385 2016-01-01 04:00:00
在这里,我只是针对以下内容。 1),我找到了单词" Replicate"具有特色并将这些索引记录到字典idx_dict
中。 2)为每个块创建一个python range
,它基本上索引哪些块在哪个复制行中。 3)最后,我为每个块分配一个复制的编号,但是一旦你有了范围对象,你就不需要这样做了。
#1) find where the word "replicate" is featured
indexes = df[df.year == 'Replicate'].index
#2) create the range objects
idx_dict = {}
for i in range(0,indexes.shape[0]-1):
idx_dict[i] = range(indexes[i],indexes[i+1]-1)
#3) set the replicate number in some column
df.loc[:,'rep_num'] = np.nan #preset a value for the 'rep_num' column
for i in range(0, 4):
print(i)
df.loc[idx_dict[i],'rep_num'] = i
#fill in the NAs because my indexing algorithm isn't splendid
df.rep_num.fillna(method='ffill', inplace=True)
现在,您可以根据需要通过其他地方的重复数字或商店部分对df
进行分组。
#get the number of rows in each replicate:
In [26]: df.groupby("rep_num").count()
Out[26]:
max_temp rain year
rep_num
0.0 2196 2196 2196
1.0 2196 2196 2196
2.0 2196 2196 2196
3.0 2197 2197 2197
#get the portion with the first replicate
In [27]: df.loc[df.rep_num==0,:].head()
Out[27]:
max_temp rain year rep_num
0 0.976052 0.896358 Replicate 0.0
1 -0.875221 -1.110111 2016-01-01 01:00:00 0.0
2 -0.305727 0.495230 2016-01-01 02:00:00 0.0
3 0.694737 -0.356541 2016-01-01 03:00:00 0.0
4 0.325071 0.669536 2016-01-01 04:00:00 0.0