对于某些数据集group_1
,我需要对所有行k
进行迭代以提高鲁棒性,并根据表示为数据帧列的某些标准找到另一个数据帧group_2
的匹配随机样本。
不幸的是,这相当慢。
如何提高性能?
瓶颈是apply
版的功能,即randomMatchingCondition
。
import tqdm
import numpy as np
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
seed = 47
np.random.seed(seed)
###################################################################
# generate dummy data
size = 10000
df = pd.DataFrame({i: np.random.randint(1,100,size=size) for i in ['metric']})
df['label'] = np.random.randint(0,2, size=size)
df['group_1'] = pd.Series(np.random.randint(1,12, size=size)).astype(object)
df['group_2'] = pd.Series(np.random.randint(1,10, size=size)).astype(object)
group_0 = df[df['label'] == 0]
group_0 = group_0.reset_index(drop=True)
group_0 = group_0.rename(index=str, columns={"metric": "metric_group_0"})
join_columns_enrich = ['group_1', 'group_2']
join_real = ['metric_group_0']
join_real.extend(join_columns_enrich)
group_0 = group_0[join_real]
display(group_0.head())
group_1 = df[df['label'] == 1]
group_1 = group_1.reset_index(drop=True)
display(group_1.head())
###################################################################
# naive find random element matching condition
def randomMatchingCondition(original_element, group_0, join_columns, random_state):
limits_dict = original_element[join_columns_enrich].to_dict()
query = ' & '.join([f"{k} == {v}" for k, v in limits_dict.items()])
candidates = group_0.query(query)
if len(candidates) > 0:
return candidates.sample(n=1, random_state=random_state)['metric_group_0'].values[0]
else:
return np.nan
###################################################################
# iterate over pandas dataframe k times for more robust sampling
k = 3
resulting_df = None
for i in range(1, k+1):
group_1['metric_group_0'] = group_1.progress_apply(randomMatchingCondition,
args=[group_0, join_columns_enrich, None],
axis = 1)
group_1['run'] = i
if resulting_df is None:
resulting_df = group_1.copy()
else:
resulting_df = pd.concat([resulting_df, group_1])
resulting_df.head()
对数据进行预排序的实验:
group_0 = group_0.sort_values(join_columns_enrich)
group_1 = group_1.sort_values(join_columns_enrich)
没有任何区别。
答案 0 :(得分:0)
IIUC您想要在输入数据框中的每一行(指标组合)中得到k
个随机样本。那么为什么不candidates.sample(n=k, ...)
,而摆脱for
循环呢?或者,您可以将数据帧与k
串联pd.concat([group1] * k)
次。
这取决于您的真实数据,但是我想举一个镜头,将输入数据帧按指标列用group1.groupby(join_columns_enrich)
分组(如果基数足够低),然后对这些组进行随机采样,为每个选择k * len(group.index)
个随机样本。 groupby
很昂贵,OTOH一旦完成,您可能会在迭代/采样上节省很多。
答案 1 :(得分:0)
@smiandras,您是正确的。摆脱for循环很重要。
变体1:多个样本:
public class EnterScene : MonoBehaviour
{
public string transitionName; //Also 1-1
void Start()
{
if (transitionName == PlayerController.sharedInstance.areaTransitionName)
{
PlayerController.sharedInstance.transform.position = transform.position; //Moves the player to GameObject position
}
}
// Update is called once per frame
void Update()
{
}
}
变量2:通过本机JOIN操作优化的所有可能样本。
警告这有点不安全,因为它可能会产生巨大的行数:
def randomMatchingCondition(original_element, group_0, join_columns, k, random_state):
limits_dict = original_element[join_columns_enrich].to_dict()
query = ' & '.join([f"{k} == {v}" for k, v in limits_dict.items()])
candidates = group_0.query(query)
if len(candidates) > 0:
return candidates.sample(n=k, random_state=random_state, replace=True)['metric_group_0'].values
else:
return np.nan
###################################################################
# iterate over pandas dataframe k times for more robust sampling
k = 3
resulting_df = None
#######################
# trying to improve performance: sort both dataframes
group_0 = group_0.sort_values(join_columns_enrich)
group_1 = group_1.sort_values(join_columns_enrich)
#######################
group_1['metric_group_0'] = group_1.progress_apply(randomMatchingCondition,
args=[group_0, join_columns_enrich, k, None],
axis = 1)
print(group_1.isnull().sum())
group_1 = group_1[~group_1.metric_group_0.isnull()]
display(group_1.head())
s=pd.DataFrame({'metric_group_0':np.concatenate(group_1.metric_group_0.values)},index=group_1.index.repeat(group_1.metric_group_0.str.len()))
s = s.join(group_1.drop('metric_group_0',1),how='left')
s['pos_in_array'] = s.groupby(s.index).cumcount()
s.head()