假设我有以下pandas数据框:
userID dayID feature0 feature1 feature2 feature3
xy1 0 24 15.3 41 43
xy1 1 5 24 34 40
xy1 2 30 7 8 10
gh3 0 50 4 11 12
gh3 1 49 3 59 11
gh3 2 4 9 12 15
...
有许多用户ID,每个用户每天有3天4个功能。 我想要做的是为每个功能,随机选择1天,然后减少矩阵。因此,例如,如果要素0是第1天,则要素1使用第0天,要素2使用第0天,要素3使用第2天:
userID feature0 feature1 feature2 feature3
xy1 5 15.3 41 10
gh3 49 4 11 15
...
等等。
我想出了:
我认为这段代码有效,但事实并非如此。
reduced_features = features.reset_index().groupby('userID').agg(lambda x: np.random.choice(x,1))
但这似乎很慢。有没有更快的方法呢?
答案 0 :(得分:1)
由于您没有得到更多建议,我会试一试:
检查以下代码示例(代码注释中的说明):
import pandas as pd
import numpy as np
from io import StringIO
str = """userID dayID feature0 feature1 feature2 feature3
xy1 0 24 15.3 41 43
xy1 1 5 24.0 34 40
xy1 2 30 7.0 8 10
gh3 0 50 4.0 11 12
gh3 1 49 3.0 59 11
gh3 2 4 9.0 12 15
"""
df = pd.read_table(StringIO(str), sep='\s+')
def randx(dfg):
# create a list of row-indices and make sure 0,1,2 are all in so that
# all dayIDs are covered and the last one is randomly selected from [0,1,2]
x = [ 0, 1, 2, np.random.randint(3) ]
# shuffle the list of row-indices
np.random.shuffle(x)
# enumerate list-x, with the row-index and the counter aligned with the column-index,
# to retrieve the actual element in the dataframe. the 2 in enumerate
# is to skip the first two columns which are 'userID' and 'dayID'
return pd.Series([ dfg.iat[j,i] for i,j in enumerate(x,2) ])
## you can also return the list of result into one column
# return [ dfg.iat[j,i] for i,j in enumerate(x,2) ]
def feature_name(x):
return 'feature{}'.format(x)
# if you have many irrelevant columns, then
# retrieve only columns required for calculations
# if you have 1000+ columns(features) and all are required
# skip the following line, you might instead split your dataframe using slicing,
# i.e. putting 200 features for each calculation, and then merge the results
new_df = df[[ "userID", "dayID", *map(feature_name, [0,1,2,3]) ]]
# do the calculations
d1 = (new_df.groupby('userID')
.apply(randx)
# comment out the following .rename() function if you want to
# return list instead of Series
.rename(feature_name, axis=1)
)
print(d1)
##
feature0 feature1 feature2 feature3
userID
gh3 4.0 9.0 59.0 12.0
xy1 24.0 7.0 34.0 10.0
更多想法:
在运行apply(randx)之前,可以抛出满足要求的随机行索引列表。例如,如果所有userID具有相同数量的dayID,则可以使用预设这些行索引的列表列表。你也可以使用列表字典。
提醒:如果您使用列表列表和L.pop()来显示行索引,请确保列表的数量至少应为唯一的userID + 1的数量,因为GroupBy.apply()在第一组
不是在函数randx()中返回pd.Series(),而是直接返回一个列表(参见函数randx()中的注释返回)。在这种情况下,所有检索到的功能都将保存在一列中(见下文),您可以稍后对其进行标准化。
userID
gh3 [50, 3.0, 59, 15]
xy1 [30, 7.0, 34, 43]
如果它仍然运行缓慢,您可以将1000多列(功能)分成组,即每次运行处理200个功能,相应地切片预定义的行索引,然后合并结果。
更新:
N_users = 100
N_days = 7
N_features = 1000
users = [ 'user{}'.format(i) for i in range(N_users) ]
days = [ 'day{}'.format(i) for i in range(N_days) ]
data = []
for u in users:
for d in days:
data.append([ u, d, *np.random.rand(N_features)])
def feature_name(x):
return 'feature{}'.format(x)
df = pd.DataFrame(data, columns=['userID', 'dayID', *map(feature_name, range(N_features))])
def randx_to_series(dfg):
x = [ *range(N_days), *np.random.randint(N_days, size=N_features-N_days) ]
np.random.shuffle(x)
return pd.Series([ dfg.iat[j,i] for i,j in enumerate(x,2) ])
def randx_to_list(dfg):
x = [ *range(N_days), *np.random.randint(N_days, size=N_features-N_days) ]
np.random.shuffle(x)
return [ dfg.iat[j,i] for i,j in enumerate(x,2) ]
In [133]: %timeit d1 = df.groupby('userID').apply(randx_to_series)
7.82 s +/- 202 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)
In [134]: %timeit d1 = df.groupby('userID').apply(randx_to_list)
7.7 s +/- 47.2 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)
In [135]: %timeit d1 = df.groupby('userID').agg(lambda x: np.random.choice(x,1))
8.18 s +/- 31.1 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)
# new test: calling np.random.choice() w/o using the lambda is much faster
In [xxx]: timeit d1 = df.groupby('userID').agg(np.random.choice)
4.63 s +/- 24.7 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)
然而,速度与使用agg(np.random.choice())的原始方法相似,但理论上不正确。你可能不得不定义你期望的那么慢。
randx_to_series()的更多测试:
with 2000 features, thus total 2002 columns:
%%timeit
%run ../../../pandas/randomchoice-2-example.py
...:
15.8 s +/- 225 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)
with 5000 features, thus total 5002 columns:
%%timeit
%run ../../../pandas/randomchoice-2-example.py
...:
39.3 s +/- 628 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)
with 10000 features, thus 10002 columns:
%%timeit
%run ../../../pandas/randomchoice-2-example.py
...:
1min 21s +/- 1.73 s per loop (mean +/- std. dev. of 7 runs, 1 loop each)
希望这有帮助。
环境:Python 3.6.4,Pandas 0.22.0
答案 1 :(得分:0)
我承认,我对这个解决方案有点创意。
我认为您发布的代码并不符合您在问题中解释的内容。但是,这里有一些代码可以通过userid对每个功能进行随机化。
df.groupby('userID').apply(lambda x: x.apply(lambda x: x.sample(n=1)).ffill().bfill().head(1))
输出:
userID dayID feature0 feature1 feature2 feature3
userID
gh3 3 gh3 1.0 50.0 4.0 59.0 11.0
xy1 0 xy1 2.0 5.0 7.0 41.0 40.0
注意,这可能真的很慢,看起来可能是一个不太好的解决方案会更快。