Question

我有一个非常大的CSV文件（数十Gigas），其中包含以下列的网络日志：user_id，time_stamp，category_clicked。我必须建立一个得分手来识别用户喜欢和不喜欢的类别。请注意，我有超过1000万用户。

我首先将其剪切成块并将其存储在名为HDFStore的{{1}}中，然后在input.h5后groupby user_id上使用user id | timestamp | category_clicked 20140512081646222000004-927168801|20140722|7 20140512081714121000004-383009763|20140727|4 201405011348508050000041009490586|20140728|1 20140512081646222000004-927168801|20140724|1 20140501135024818000004-1623130763|20140728|3。

这是我的数据：大约2亿行，10百万个唯一用户ID。

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: AMD64 Family 21 Model 2 Stepping 0, AuthenticAMD
byteorder: little
LC_ALL: None
LANG: fr_FR

pandas: 0.13.1
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 2.0.0
sphinx: 1.2.2
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.2
pytz: 2013.9
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: None
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: None
sqlalchemy: 0.9.4
lxml: None
bs4: None
html5lib: None
bq: None
apiclient: None

这是我的pandas.show_version（）：

[0.1,0.45,0.89,1.45,5.12,0.,0.,0.45,0.12,2.36,7.8]

这是我想要的输出：

对于每个user_id，列表clean_input_reader = read_csv(work_path + '/input/input.csv', chunksize=500000) with get_store(work_path+'/input/input.h5') as store: for chunk in clean_input_reader: store.append('clean_input', chunk, data_columns=['user_id','timestamp','category_clicked'], min_itemsize=15) groups = store.select_column('clean_input','user_id').unique() for user in groups: group_user = store.select('clean_input',where=['user_id==%s' %user]) <<<<TREATMENT returns a list user_cat_score>>>> store.append(user, Series(user_cat_score))表示每个类别的用户得分和全局得分。我无法告诉您有关分数的更多信息，但它需要所有时间戳和category_clicked才能计算出来。你不能稍后总结或类似的事情。

这是我的代码：

group_user=store.select('clean_input',where=['user_id==%s' %user])

我的问题如下：它在我看来这条线： store.select时间复杂度太高，因为我有很多小组，而且我确信groupby的例程中有很多冗余排序，如果我应用它数千万次。

为了给你一个估计，我使用这种技术 250秒来处理1000个密钥，而不是通常{{1}的 1秒使用read_csv读取全内存CSV文件而不进行分块。

********** ***********

UPDATE

应用Jeff的哈希方法后，我可以在1秒内处理1000个密钥（与完整的内存中方法相同），并且绝对减少了RAM的使用。我以前没有的唯一时间惩罚当然是我采取分块的时间，保存100个哈希组，并从商店中获取哈希组中的真实组。但是这项操作不会花费几分钟的时间。

Answer 1

这是一个任意缩放此问题的解决方案。这实际上是此问题的高密度版本here

定义一个函数以将特定组值散列到较少数量的组。我会将其设计为将数据集划分为内存中可管理的部分。

def sub_group_hash(x):
    # x is a dataframe with the 'user id' field given above
    # return the last 2 characters of the input
    # if these are number like, then you will be sub-grouping into 100 sub-groups
    return x['user id'].str[-2:]

使用上面提供的数据，这会在输入数据上创建一个分组框架，如下所示：

In [199]: [ (grp, grouped) for grp, grouped in df.groupby(sub_group_hash) ][0][1]
Out[199]: 
                             user id  timestamp  category
0  20140512081646222000004-927168801   20140722         7
3  20140512081646222000004-927168801   20140724         1

以grp作为组的名称，grouped作为结果框

# read in the input in a chunked way
clean_input_reader = read_csv('input.csv', chunksize=500000)
with get_store('output.h5') as store:
    for chunk in clean_input_reader:

        # create a grouper for each chunk using the sub_group_hash
        g = chunk.groupby(sub_group_hash)

        # append each of the subgroups to a separate group in the resulting hdf file
        # this will be a loop around the sub_groups (100 max in this case)
        for grp, grouped in g:

            store.append('group_%s' % grp, grouped,
                         data_columns=['user_id','timestamp','category_clicked'],
                         min_itemsize=15)

现在你有一个包含100个子组的hdf文件（如果没有表示所有组，可能会少一些），每个子组都包含执行操作所需的所有数据。

with get_store('output.h5') as store:

    # all of the groups are now the keys of the store
    for grp in store.keys():

        # this is a complete group that will fit in memory
        grouped = store.select(grp)

        # perform the operation on grouped and write the new output
        grouped.groupby(......).apply(your_cool_function)

因此，在这种情况下，这将使问题减少100倍。如果这还不够，那么只需增加sub_group_hash即可创建更多组。

你应该争取一个较小的数字，因为HDF5更好地工作（例如，不要使10M子组失败目的，100,1000，甚至10k都可以）。但我认为100应该对你有用，除非你有一个非常狂野的群体密度（例如，你在一个群体中有大量的数字，而在其他群体中则很少）。

请注意，这个问题很容易扩展;如果需要，可以将子组存储在单独的文件中，和/或在必要时单独（并行）处理它们。

这应该使你的约会时间大约为O(number_of_sub_groups)。

麻烦与python pandas中的分块文件上的数百万个密钥分组

1 个答案: