如何基于给定的类\标签分布值对pandas数据帧或graphlab sframe进行采样,例如:我想对具有label \ class列的数据帧进行采样,以选择行,使得每个类标签被均等地提取,从而具有每个类标签的相似频率对应于类标签的均匀分布。或者最好是根据我们想要的班级分布来获取样本。
+------+-------+-------+ | col1 | clol2 | class | +------+-------+-------+ | 4 | 45 | A | +------+-------+-------+ | 5 | 66 | B | +------+-------+-------+ | 5 | 6 | C | +------+-------+-------+ | 4 | 6 | C | +------+-------+-------+ | 321 | 1 | A | +------+-------+-------+ | 32 | 432 | B | +------+-------+-------+ | 5 | 3 | B | +------+-------+-------+ given a huge dataframe like above and the required frequency distribution like below: +-------+--------------+ | class | nostoextract | +-------+--------------+ | A | 2 | +-------+--------------+ | B | 2 | +-------+--------------+ | C | 2 | +-------+--------------+
以上应基于第二帧中的给定频率分布从第一个数据帧中提取行,其中频率计数值在nostoextract列中给出,以给出一个采样帧,其中每个类最多出现2次。如果找不到足够的课程来满足所需的数量,应该忽略并继续。生成的数据帧将用于基于决策树的分类器。
正如评论员所说,采样数据帧必须包含nostoextract相应类的不同实例?除非没有足够的给定类示例,否则您只需要使用所有可用的示例。
答案 0 :(得分:4)
您可以将您的第一个数据帧拆分为特定于类的子数据帧,然后随意采样吗?
即。
dfa = df[df['class']=='A']
dfb = df[df['class']=='B']
dfc = df[df['class']=='C']
....
然后,当您在dfa,dfb,dfc上拆分/创建/过滤后,根据需要从顶部选择一个数字(如果数据框没有任何特定的排序模式)
dfasamplefive = dfa[:5]
或者使用先前评论者描述的样本方法直接采样随机样本:
dfasamplefive = dfa.sample(n=5)
如果这符合您的需求,剩下要做的就是自动完成整个过程,从您拥有的控制数据帧中提取要采样的数字,作为包含所需样本数量的第二个数据帧。
答案 1 :(得分:3)
我认为这可以解决您的问题:
import pandas as pd
data = pd.DataFrame({'cols1':[4, 5, 5, 4, 321, 32, 5],
'clol2':[45, 66, 6, 6, 1, 432, 3],
'class':['A', 'B', 'C', 'C', 'A', 'B', 'B']})
freq = pd.DataFrame({'class':['A', 'B', 'C'],
'nostoextract':[2, 2, 2], })
def bootstrap(data, freq):
freq = freq.set_index('class')
# This function will be applied on each group of instances of the same
# class in `data`.
def sampleClass(classgroup):
cls = classgroup['class'].iloc[0]
nDesired = freq.nostoextract[cls]
nRows = len(classgroup)
nSamples = min(nRows, nDesired)
return classgroup.sample(nSamples)
samples = data.groupby('class').apply(sampleClass)
# If you want a new index with ascending values
# samples.index = range(len(samples))
# If you want an index which is equal to the row in `data` where the sample
# came from
samples.index = samples.index.get_level_values(1)
# If you don't change it then you'll have a multiindex with level 0
# being the class and level 1 being the row in `data` where
# the sample came from.
return samples
print(bootstrap(data,freq))
打印:
class clol2 cols1
0 A 45 4
4 A 1 321
1 B 66 5
5 B 432 32
3 C 6 4
2 C 6 5
如果您不希望按类别排序结果,最后可以permute。
答案 2 :(得分:1)
这是SFrame的解决方案。它不是完全您想要的,因为它会随机采样点,因此结果不一定恰好具有您指定的行数。一个确切的方法可能会随机地对数据进行随机抽取,然后为给定的类获取第一个k
行,但这会让你非常接近。
import random
import graphlab as gl
## Construct data.
sf = gl.SFrame({'col1': [4, 5, 5, 4, 321, 32, 5],
'col2': [45, 66, 6, 6, 1, 432, 3],
'class': ['A', 'B', 'C', 'C', 'A', 'B', 'B']})
freq = gl.SFrame({'class': ['A', 'B', 'C'],
'number': [3, 1, 0]})
## Count how many instances of each class and compute a sampling
# probability.
grp = sf.groupby('class', gl.aggregate.COUNT)
freq = freq.join(grp, on ='class', how='left')
freq['prob'] = freq.apply(lambda x: float(x['number']) / x['Count'])
## Join the sampling probability back to the original data.
sf = sf.join(freq[['class', 'prob']], on='class', how='left')
## Sample the original data, then subset.
sf['sample_mask'] = sf.apply(lambda x: 1 if random.random() <= x['prob']
else 0)
sf2 = sf[sf['sample_mask'] == 1]
在我的示例运行中,我碰巧得到了我指定的确切数量的样本,但同样,这个解决方案无法保证。
>>> sf2
+-------+------+------+
| class | col1 | col2 |
+-------+------+------+
| A | 4 | 45 |
| A | 321 | 1 |
| B | 32 | 432 |
+-------+------+------+