我正在研究一个脚本,该脚本从excel文件中的每个类别中抽取一个样本。根据长度的不同,可以采用不同的百分比,但是我想知道是否有一种方法可以将每个样本限制为5个项目,即使1%可以带回2个项目。任何帮助,将不胜感激。
for Guesses in range(9):
print('Take a guess.')
Guess = int(input())
if Guess < 0:
print('Please enter a positive number')
elif Guess > 100:
print('The number is only between 0 and 100')
elif Guess < Number:
print('Higher...')
elif Guess > Number:
print('Lower...')
else:
print('Spot on!')
break # Guess was correct
答案 0 :(得分:1)
您可以使用x.size * 0.01
来检查可以获取多少个值,并使用sample(n=5)
而不是sample(frac=0.01)
.apply(lambda x: x.sample(n=5) if x.size*0.01 < 5 else x.sample(frac=0.01))
import pandas as pd
import random
random.seed(1) # to generate always the same random data
data = {'Category': [random.choice([1,2,2,2,3]) for x in range(1000)]} # columns
df = pd.DataFrame(data)
print(df)
# --- before ---
df1 = df.groupby('Category').apply(lambda x: x.sample(frac=0.01))
print('--- before ---')
print(df1['Category'].value_counts())
# --- after ---
df2 = df.groupby('Category').apply(lambda x: x.sample(n=5) if x.size*.01 < 5 else x.sample(frac=0.01))
print('--- after ---')
print(df2['Category'].value_counts())
结果
--- before ---
2 6
3 2
1 2
Name: Category, dtype: int64
--- after ---
2 6
3 5
1 5
Name: Category, dtype: int64
编辑:更具可读性
def myfunction(x):
if x.size*0.01 < 5:
return x.sample(n=5)
else:
return x.sample(frac=0.01)
df1 = df.groupby('Category').apply(myfunction)