在循环内非重复地从熊猫数据帧中采样

时间:2016-07-05 16:13:13

标签: python pandas dataframe

我尽可能地简化了代码,但它仍然很长,它应该说明问题。

我正在从数据框中抽取天气数据:

import numpy as np
import pandas as pd

#dataframe 
dates = pd.date_range('19510101',periods=16000)
data = pd.DataFrame(data=np.random.randint(0,100,(16000,1)), columns =list('A'))
data['date'] = dates
data = data[['date','A']]

#create year and season column
def get_season(row):
    if row['date'].month >= 3 and row['date'].month <= 5:
        return '2'
    elif row['date'].month >= 6 and row['date'].month <= 8:
        return '3'
    elif row['date'].month >= 9 and row['date'].month <= 11:
        return '4'
    else:
        return '1'

data['Season'] = data.apply(get_season, axis=1)
data['Year'] = data['date'].dt.year

我想选择使用预定年/季节元组的随机年份:

#generate an index of year and season tuples
index =  [(1951L, '1'),
 (1951L, '2'),
 (1952L, '4'),
 (1954L, '3'),
 (1955L, '1'),
 (1955L, '2'),
 (1956L, '3'),
 (1960L, '4'),
 (1961L, '3'),
 (1962L, '2'),
 (1962L, '3'),
 (1979L, '2'),
 (1979L, '3'),
 (1980L, '4'),
 (1983L, '2'),
 (1984L, '2'),
 (1984L, '4'),
 (1985L, '3'),
 (1986L, '1'),
 (1986L, '2'),
 (1986L, '3'),
 (1987L, '4'),
 (1991L, '1'),
 (1992L, '4')]

以下列方式对此进行采样:

生成4个列表,其中包含每个季节的年份(春季一个,夏季一个等)

coldsample = [[],[],[],[]] #empty list of lists
for (yr,se) in index: 
    coldsample[int(se)-1] += [yr] #function which gives the years which have extreme seasons [[1],[2],[3],[4]]
coldsample

从此列表中选择一个随机年份

cold_ctr = 0 #variable to count from (1 is winter, 2 spring, 3 summer, 4 autumn)
coldseq = [] #blank list
for yrlist in coldsample: 
        ran_yr = np.random.choice(yrlist, 1) #choose a randomly sampled year from previous cell
        cold_ctr += 1 # increment cold_ctr variable by 1
        coldseq += [(ran_yr[0], cold_ctr)] #populate coldseq with a random year and a random season (in order)

然后生成一个选择多个随机年的新数据框

df = []
for i in range (5): #change the number here to change the number of output years
    for item in coldseq: #item is a tuple with year and season, coldseq is  cold year and season pairs 
        df.append(data.query("Year == %d and Season == '%d'" % item))

问题在于,每次都选择coldseq(具有相同的年/季组合),并且不会生成新的冷搜索。我需要将coldseq重置为空并为最终for循环的每次迭代生成一个新的,但是看不到这样做的方法。我已尝试以多种方式在循环中嵌入代码,但它似乎不起作用。

2 个答案:

答案 0 :(得分:0)

您可以从索引创建第二个数据框,然后对其进行采样。

df_index = pd.DataFrame(index)
coldseq = df_index.sample(5)

coldseq.apply(lambda x: df.append("Year == '{0}' and Season == '{1}'".format(x[0], x[1])), axis = 1) # or similar to append the query

答案 1 :(得分:0)

想出来,嵌入循环并在循环中将计数器重置为0:

cold_ctr = 0 #variable to count from (1 is winter, 2 spring, 3 summer, 4 autumn)
coldseq = [] #blank list

df = []
#number of cold years
for i in range (5): #change number here for number of cold years
    for yrlist in coldsample:
        ran_yr = np.random.choice(yrlist, 1) #choose a randomly sampled year from previous cell
        cold_ctr += 1 # increment cold_ctr variable by 1
        coldseq += [(ran_yr[0], cold_ctr)]
    for item in coldseq: #item is a tuple with year and season, coldseq is all extreme cold year and season pairs 
        df.append(data.query("Year == %d and Season == '%d'" % item))
        coldseq = [] #reset coldseq to an empty list so it samples from a new random year
        cold_ctr = 0 #reset counter to 0 so seasons stay as 1,2,3,4