熊猫队的抽样小组

时间:2014-08-08 12:46:25

标签: python pandas

假设我想从Pandas中的数据框中执行分层示例,以便为给定列的每个值获取5%行。我怎么能这样做?

例如,在下面的数据框中,我想对与5%列的每个值相关联的行的Z进行采样。有没有办法从内存中加载的数据框样本组

> df 

   X   Y  Z
   1 123  a
   2  89  b
   1 234  a
   4 893  a
   6 234  b
   2 893  b
   3 200  c
   5 583  c
   2 583  c
   6 100  c

更一般地说,如果我在磁盘中的这个数据帧在一个巨大的文件中(例如8 GB的csv文件)怎么办?有没有办法在不必将整个数据帧加载到内存中的情况下进行此采样?

1 个答案:

答案 0 :(得分:3)

如何仅加载' Z'使用' usecols'列到内存中选项。假设该文件是sample.csv。如果你有一堆列,那应该使用更少的内存。然后假设适合记忆,我认为这对你有用。

stratfraction = 0.05
#Load only the Z column
df = pd.read_csv('sample.csv', usecols = ['Z'])
#Generate the counts per value of Z
df['Obs']  = 1
gp = df.groupby('Z')
#Get number of samples per group 
df2 = np.ceil(gp.count()*stratfraction)
#Generate the indices of the request sample (first entrie)
stratsample = []
for i, key in enumerate(gp.groups):
    FirstFracEntries = gp.groups[key][0:int(df2['Obs'][i])]
    stratsample.extend(FirstFracEntries) 
#Generate a list of rows to skip since read_csv doesn't have a rows to keep option
stratsample.sort
RowsToSkip = set(df.index.values).difference(stratsample)
#Load only the requested rows (no idea how well this works for a really giant list though)         
df3 = df = pd.read_csv('sample.csv', skiprows  = RowsToSkip)