我正在处理带有点击流的数据帧,并为点击流中的每个用户提取功能,以供机器学习项目中使用。
数据框是这样的:
data = pd.DataFrame({'id':['A01','B01','A01','C01','A01','B01','A01'],
'event':['search','search','buy','home','cancel','home','search'],
'date':['2018-01-01','2018-01-01','2018-01-02','2018-01-03','2018-01-04','2018-01-04','2018-01-06'],
'product':['tablet','dvd','tablet','tablet','tablet','book','book'],
'price': [103,2,203,103,203,21,21]})
data['date'] = pd.to_datetime(data['date'])
由于我必须为每个用户创建功能,因此我使用具有自定义功能的groupby / apply:
featurized = data.groupby('id').apply(featurize)
创建用户功能将占用大量数据框并创建许多(数百个)功能。整个过程太慢,因此我正在寻找建议以更有效地执行此操作。
用于创建要素的功能的示例:
def featurize(group):
features = dict()
# Userid
features['id'] = group['id'].max()
# Feature 1: Number of search events
features['number_of_search_events'] = (group['event']=='search').sum()
# Feature 2: Number of tablets
features['number_of_tablets'] = (group['product']=='tablet').sum()
# Feature 3: Total time
features['total_time'] = (group['date'].max() - group['date'].min()) / np.timedelta64(1,'D')
# Feature 4: Total number of events
features['events'] = len(group)
# Histogram of products examined
product_counts = group['product'].value_counts()
# Feature 5 max events for a product
features['max_product_events'] = product_counts.max()
# Feature 6 min events for a product
features['min_product_events'] = product_counts.min()
# Feature 7 avg events for a product
features['mean_product_events'] = product_counts.mean()
# Feature 8 std events for a product
features['std_product_events'] = product_counts.std()
# Feature 9 total price for tablet products
features['tablet_price_sum'] = group.loc[group['product']=='tablet','price'].sum()
# Feature 10 max price for tablet products
features['tablet_price_max'] = group.loc[group['product']=='tablet','price'].max()
# Feature 11 min price for tablet products
features['tablet_price_min'] = group.loc[group['product']=='tablet','price'].min()
# Feature 12 mean price for tablet products
features['tablet_price_mean'] = group.loc[group['product']=='tablet','price'].mean()
# Feature 13 std price for tablet products
features['tablet_price_std'] = group.loc[group['product']=='tablet','price'].std()
return pd.Series(features)
一个潜在的问题是,每个功能都可能扫描整个块,因此,如果我有100个功能,我将扫描该块100次而不是一次。
例如,一个功能可以是用户拥有的“平板电脑”事件的数量,其他可以是“家庭”事件的数量,其他可以是“搜索”事件之间的平均时间差,然后是“可以将“事件”搜索为“片剂”等。每个功能都可以编码为一个功能,该功能需要一个块(df)并创建该功能,但是当我们有100多个功能时,每个功能都将扫描整个块,而一次线性扫描就足够了。问题是,如果我对块中的每个记录进行手动循环并对循环中的所有功能进行编码,则代码将变得难看。
问题:
如果我必须处理一个数据帧数百次,是否可以通过一次扫描将其抽象化,从而创建所有需要的功能?
与我当前使用的groupby / apply方法相比,速度有改善吗?
答案 0 :(得分:3)
免责声明:以下答案不能正确回答以上问题。只是为了工作而将其留在这里。也许在某些时候会有用。
group.loc[group['product']=='tablet','price']
)HDFStore
),请使用缓存对于(1),鉴于上面的代码,我可以产生高达43%的加速(i7-7700HQ CPU,16GB RAM)。
时间
using joblib: 68.86841534099949s
using multiprocessing: 71.53540843299925s
single-threaded: 119.05010353899888s
代码
import pandas as pd
import numpy as np
import time
import timeit
import os
import joblib
import multiprocessing
import pandas as pd
import numpy as np
import timeit
import joblib
import multiprocessing
def make_data():
# just some test data ...
n_users = 100
events = ['search', 'buy', 'home', 'cancel']
products = ['tablet', 'dvd', 'book']
max_price = 1000
n_duplicates = 1000
n_rows = 40000
df = pd.DataFrame({
'id': list(map(str, np.random.randint(0, n_users, n_rows))),
'event': list(map(events.__getitem__, np.random.randint(0, len(events), n_rows))),
'date': list(map(pd.to_datetime, np.random.randint(0, 100000, n_rows))),
'product': list(map(products.__getitem__, np.random.randint(0, len(products), n_rows))),
'price': np.random.random(n_rows) * max_price
})
df = pd.concat([df for _ in range(n_duplicates)])
df.to_pickle('big_df.pkl')
return df
def data():
return pd.read_pickle('big_df.pkl')
def featurize(group):
features = dict()
# Feature 1: Number of search events
features['number_of_search_events'] = (group['event'] == 'search').sum()
# Feature 2: Number of tablets
features['number_of_tablets'] = (group['product'] == 'tablet').sum()
# Feature 3: Total time
features['total_time'] = (group['date'].max() - group['date'].min()) / np.timedelta64(1, 'D')
# Feature 4: Total number of events
features['events'] = len(group)
# Histogram of products examined
product_counts = group['product'].value_counts()
# Feature 5 max events for a product
features['max_product_events'] = product_counts.max()
# Feature 6 min events for a product
features['min_product_events'] = product_counts.min()
# Feature 7 avg events for a product
features['mean_product_events'] = product_counts.mean()
# Feature 8 std events for a product
features['std_product_events'] = product_counts.std()
# Feature 9 total price for tablet products
features['tablet_price_sum'] = group.loc[group['product'] == 'tablet', 'price'].sum()
# Feature 10 max price for tablet products
features['tablet_price_max'] = group.loc[group['product'] == 'tablet', 'price'].max()
# Feature 11 min price for tablet products
features['tablet_price_min'] = group.loc[group['product'] == 'tablet', 'price'].min()
# Feature 12 mean price for tablet products
features['tablet_price_mean'] = group.loc[group['product'] == 'tablet', 'price'].mean()
# Feature 13 std price for tablet products
features['tablet_price_std'] = group.loc[group['product'] == 'tablet', 'price'].std()
return pd.DataFrame.from_records(features, index=[group['id'].max()])
# https://stackoverflow.com/questions/26187759/parallelize-apply-after-pandas-groupby
def apply_parallel_job(dfGrouped, func):
retLst = joblib.Parallel(n_jobs=multiprocessing.cpu_count())(
joblib.delayed(func)(group) for name, group in dfGrouped)
return pd.concat(retLst)
def apply_parallel_pool(dfGrouped, func):
with multiprocessing.Pool(multiprocessing.cpu_count()) as p:
ret_list = list(p.map(func, [group for name, group in dfGrouped]))
return pd.concat(ret_list)
featurized_job = lambda df: apply_parallel_job(df.groupby('id'), featurize)
featurized_pol = lambda df: apply_parallel_pool(df.groupby('id'), featurize)
featurized_sng = lambda df: df.groupby('id').apply(featurize)
make_data()
print(timeit.timeit("featurized_job(data())", "from __main__ import featurized_job, data", number=3))
print(timeit.timeit("featurized_sng(data())", "from __main__ import featurized_sng, data", number=3))
print(timeit.timeit("featurized_pol(data())", "from __main__ import featurized_pol, data", number=3))
对于(7),请考虑以下重构:
时间
original: 112.0091859719978s
re-used prices: 83.85681765000118s
代码
# [...]
prices_ = group.loc[group['product'] == 'tablet', 'price']
features['tablet_price_sum'] = prices_.sum()
# Feature 10 max price for tablet products
features['tablet_price_max'] = prices_.max()
# Feature 11 min price for tablet products
features['tablet_price_min'] = prices_.min()
# Feature 12 mean price for tablet products
features['tablet_price_mean'] = prices_.mean()
# Feature 13 std price for tablet products
features['tablet_price_std'] = prices_.std()
# [...]
答案 1 :(得分:1)
DataFrame.agg()是您的朋友在这里。您是对的,因为实现的初始方法会遍历整个数据集以进行EACH调用。因此,我们可以做的就是在开始时定义所有我们想做的繁重工作,然后让熊猫处理所有内部优化。通常,使用这些库,很少有时间可以编写仅使用内部库就能胜过的东西。
此方法的优点在于,您只需一次进行繁重的计算,然后就可以对过滤后的数据集进行所有微调的特征创建,因为它的速度要快得多。
这将运行时间减少了65%,这是非常大的。而且,下次您要获取新的统计信息时,只需访问featurize2的结果即可,而不必再次运行计算。
df = make_data()
# include this to be able to calculate standard deviations correctly
df['price_sq'] = df['price'] ** 2.
def featurize2(df):
grouped = df.groupby(['id', 'product', 'event'])
initial = grouped.agg({'price': ['count', 'max', 'min', 'mean', 'std', 'sum', 'size'], 'date': [
'max', 'min'], 'price_sq': ['sum']}).reset_index()
return initial
def featurize3(initial):
# Features 5-8
features = initial.groupby('product').sum()['price']['count'].agg(['max', 'min', 'mean', 'std']).rename({
'max': 'max_product_events',
'min': 'min_product_events',
'mean': 'mean_product_events',
'std': 'std_product_events'
})
searches = initial[initial['event'] == 'search']['price']
# Feature 1: Number of search events
features['number_of_search_events'] = searches['count'].sum()
tablets = initial[initial['product'] == 'tablet']['price']
tablets_sq = initial[initial['product'] == 'tablet']['price_sq']
# Feature 2: Number of tablets
features['number_of_tablets'] = tablets['count'].sum()
# Feature 9 total price for tablet products
features['tablet_price_sum'] = tablets['sum'].sum()
# Feature 10 max price for tablet products
features['tablet_price_max'] = tablets['max'].max()
# Feature 11 min price for tablet products
features['tablet_price_min'] = tablets['min'].min()
# Feature 12 mean price for tablet products
features['tablet_price_mean'] = (
tablets['mean'] * tablets['count']).sum() / tablets['count'].sum()
# Feature 13 std price for tablet products
features['tablet_price_std'] = np.sqrt(tablets_sq['sum'].sum(
) / tablets['count'].sum() - features['tablet_price_mean'] ** 2.)
# Feature 3: Total time
features['total_time'] = (initial['date']['max'].max(
) - initial['date']['min'].min()) / np.timedelta64(1, 'D')
# Feature 4: Total number of events
features['events'] = initial['price']['count'].sum()
return features
def new_featurize(df):
initial = featurize2(df)
final = featurize3(initial)
return final
original = featurize(df)
final = new_featurize(df)
for x in final.index:
print("outputs for index {} are equal: {}".format(
x, np.isclose(final[x], original[x])))
print("featurize(df): {}".format(timeit.timeit("featurize(df)",
"from __main__ import featurize, df", number=3)))
print("featurize2(df): {}".format(timeit.timeit("featurize2(df)",
"from __main__ import featurize2, df", number=3)))
print("new_featurize(df): {}".format(timeit.timeit("new_featurize(df)",
"from __main__ import new_featurize, df", number=3)))
for x in final.index:
print("outputs for index {} are equal: {}".format(
x, np.isclose(final[x], original[x])))
结果
featurize(df): 76.0546050072
featurize2(df): 26.5458261967
new_featurize(df): 26.4640090466
outputs for index max_product_events are equal: [ True]
outputs for index min_product_events are equal: [ True]
outputs for index mean_product_events are equal: [ True]
outputs for index std_product_events are equal: [ True]
outputs for index number_of_search_events are equal: [ True]
outputs for index number_of_tablets are equal: [ True]
outputs for index tablet_price_sum are equal: [ True]
outputs for index tablet_price_max are equal: [ True]
outputs for index tablet_price_min are equal: [ True]
outputs for index tablet_price_mean are equal: [ True]
outputs for index tablet_price_std are equal: [ True]
outputs for index total_time are equal: [ True]
outputs for index events are equal: [ True]