Question

上下文：我在python中生成合成数据，然后我将其存储为搁置对象。

除此之外，我使用个人得分PER类别，频率项目集挖掘＆amp;来构建我的评分模型。协作过滤 - 在后两者中，我需要能够扫描类似的其他用户类别，除此之外。这就是为什么我选择使用dict数据结构，以便更快地访问。如果您看到此用例的更好的数据结构，请指出我。

我的主要想法是 - 完成此原型之后，这将针对真实用户完成，因此np.random.choice将不会消耗大约98％的时间，它现在消耗的时间。除此之外，我还能如何更快地做到这一点。

我还提到了列表中＃项的范围，为您提供＃users＆gt;＆gt;的上下文＃epoch times /每个用户的足迹。

数据结构如下 - ：

    {
        'U1': {
            u'Vegetarian/Vegan': [1401572571,
            7.850284461542393e-06],
            u'Dumplings': [1402051698,
            1.408092980963553e-05],
            u'Gluten-free': [1400490069,
            2.096967508708229e-06],
            u'Lawyer': [1406549628,
            0.0033942020684690575],
            u'Falafel': [1409365870,
            0.10524975778067258]
        },
        'U0': {
            u'GasStation/Garage': [1394649440,
            1.1279761215136923e-09],
            u'MusicFestival': [1398401459,
            1.0948922055743449e-07],
            u'Chinese': [1408116633,
            0.015294426605167652]
}}

您在每个纪元时间之后看到的浮点数是该用户在该类别中的得分。我在得分计算后写回来的（现在在帖子中提到的得分代码）

有关数据的更多信息 - ：我有一个名为用户U0，U1等的主键和一个名为＆＃34; category＆＃34; ，这里＆＃39;素食/素食主义者＆＃39;这些二级密钥中的每一个都有一个 1个或更多项的列表。因此，我需要绘制2个随机数字（无需替换，在低和高指数范围内。这些项目依次为纪元时代。从概念上讲，它说，用户U1，在多个纪元时代与素食/素食主义者互动，我将列表存储为类别键的值。

说你是素食主义者/素食主义者：[1401572571]，然后对于每个类别，我计算得分并将其写回相同的货架对象，发布合成数据。这是代码的精简版本。

问题：我注意到，在5000个用户的数据集中，搁置需要超过6小时才能创建搁置对象。 我做错了什么？我需要能够将此扩展到大约50,000个或更多用户。我也做了一些prelim line＆amp;内存分析，我将分析结果附加到一组5个用户上。

import json,math,codecs,pickle
import numpy as np
from collections import defaultdict
import shelve
from contextlib import closing


global low,high,no_categories,low_epoch_time,high_epoch_time,epoch_time_range,no_users
basepath="/home/ekta/LatLongPrototype/FINALDUMP/"
low,high=6,15
no_categories=xrange(low,high+1)
low_epoch_time,high_epoch_time=1393200055,1409400055
epoch_time_range=xrange(low_epoch_time,high_epoch_time+1)
no_users=5000
global users
users=[]
global shelf_filehandle
shelf_filehandle=basepath+"shelf_contents"




def Synthetic_data_shelve(path, list_cat,list_epoch_time):
    for j in xrange(len(list_cat)):
        if not list_cat[j] in path.keys():
            path[list_cat[j]] = [list_epoch_time[j]]
        else  :
            path[list_cat[j]] = path[list_cat[j]]+[list_epoch_time[j]]
    return path

def shelving():
    dict_user = shelve.open(shelf_filehandle)
    for i in xrange(no_users):
        each_footprint=int(np.random.choice(no_categories, 1,replace=False))
        list_cat=np.random.choice(sub_categories,each_footprint,replace=True)
        list_epoch_time=np.random.choice(epoch_time_range,each_footprint,replace=False)
        path =dict_user.get("U"+str(i), dict(defaultdict(dict)))
        path=Synthetic_data_shelve(path, list_cat,list_epoch_time)
        dict_user["U"+str(i)] = path
    dict_user.close()


#To test this quickly consider, categories as, 
sub_categories=["C"+str(i) for i in xrange(50)] 
shelving()

到目前为止我尝试了什么 - ：

分析程序 - ：

以下是 line_profiling 的结果 - 我看到list_epoch_time=np.random.choice(epoch_time_range,each_footprint,replace=False) takes up 99.8% of time !

我可以在表面上尝试将其定义为choice=np.random.choice，但这并没有给出明显更低的％时间。

如前所述，以下结果适用于no_users = 5。

ekta@superwomen:~$ kernprof.py -l -v  LatLong_shelving.py
Wrote profile results to LatLong_shelving.py.lprof
Timer unit: 1e-06 s

File: LatLong_shelving.py
Function: Synthetic_data_shelve at line 22
Total time: 0.000213 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    22                                           @profile
    23                                           def Synthetic_data_shelve(path, list_cat,list_epoch_time):
    24        46           49      1.1     23.0      for j in xrange(len(list_cat)):
    25        41           88      2.1     41.3          if not list_cat[j] in path.keys():
    26        19           28      1.5     13.1              path[list_cat[j]] = [list_epoch_time[j]]
    27                                                   else  :
    28        22           44      2.0     20.7              path[list_cat[j]] = path[list_cat[j]]+[list_epoch_time[j]]
    29         5            4      0.8      1.9      return path

File: LatLong_shelving.py
Function: shelving at line 31
Total time: 32.008 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    31                                           @profile
    32                                           def shelving():
    33         1         4020   4020.0      0.0      dict_user = shelve.open(shelf_filehandle)
    34         6           13      2.2      0.0      for i in xrange(no_users):
    35         5          541    108.2      0.0          each_footprint=int(np.random.choice(no_categories, 1,replace=False))
    36         5          226     45.2      0.0          list_cat=np.random.choice(sub_categories,each_footprint,replace=True)
    37         5     31942152 6388430.4     99.8          list_epoch_time=np.random.choice(epoch_time_range,each_footprint,replace=False)
    38         5         1074    214.8      0.0          path =dict_user.get("U"+str(i), dict(defaultdict(dict)))
    39         5          360     72.0      0.0          path=Synthetic_data_shelve(path, list_cat,list_epoch_time)
    40         5         3302    660.4      0.0          dict_user["U"+str(i)] = path
    41         1        56352  56352.0      0.2      dict_user.close()

而且，这是内存分析的结果。

我们如何减少对＃34; Synthetic_data_shelve＆＃34;的调用？ - 如果用于检查list_cat [j]是否在path.keys（）中的整个逻辑被转储到＆＃34;搁置＆＃34;，它会更快。我显然不能减少（Synthetic_data_shelve，path），因为路径是一个字典，并且不允许减少字典。此外，i, j loops in＆＃34; Synthetic_data_shelve＆＃34;和＆＃34;搁置＆＃34;因为它们是PER用户的独立属性，所以它将是减少的良好候选者。我怎样才能利用这个事实＆amp;更多？

ekta@superwomen:~$ python -m memory_profiler LatLong_shelving.py

Line #    Mem usage    Increment   Line Contents
================================================
    23     15.2 MiB      0.0 MiB   @profile
    24                             def Synthetic_data_shelve(path, list_cat,list_epoch_time):
    25     15.2 MiB      0.0 MiB       for j in xrange(len(list_cat)):
    26     15.2 MiB      0.0 MiB           if not list_cat[j] in path.keys():
    27     15.2 MiB      0.0 MiB               path[list_cat[j]] = [list_epoch_time[j]]
    28                                     else  :
    29     15.2 MiB      0.0 MiB               path[list_cat[j]] = path[list_cat[j]]+[list_epoch_time[j]]
    30     15.2 MiB      0.0 MiB       return path


Filename: LatLong_shelving.py

Line #    Mem usage    Increment   Line Contents
================================================
    23     15.2 MiB      0.0 MiB   @profile
    24                             def Synthetic_data_shelve(path, list_cat,list_epoch_time):
    25     15.2 MiB      0.0 MiB       for j in xrange(len(list_cat)):
    26     15.2 MiB      0.0 MiB           if not list_cat[j] in path.keys():
    27     15.2 MiB      0.0 MiB               path[list_cat[j]] = [list_epoch_time[j]]
    28                                     else  :
    29     15.2 MiB      0.0 MiB               path[list_cat[j]] = path[list_cat[j]]+[list_epoch_time[j]]
    30     15.2 MiB      0.0 MiB       return path


Filename: LatLong_shelving.py

Line #    Mem usage    Increment   Line Contents
================================================
    23     15.2 MiB      0.0 MiB   @profile
    24                             def Synthetic_data_shelve(path, list_cat,list_epoch_time):
    25     15.2 MiB      0.0 MiB       for j in xrange(len(list_cat)):
    26     15.2 MiB      0.0 MiB           if not list_cat[j] in path.keys():
    27     15.2 MiB      0.0 MiB               path[list_cat[j]] = [list_epoch_time[j]]
    28                                     else  :
    29     15.2 MiB      0.0 MiB               path[list_cat[j]] = path[list_cat[j]]+[list_epoch_time[j]]
    30     15.2 MiB      0.0 MiB       return path


Filename: LatLong_shelving.py

Line #    Mem usage    Increment   Line Contents
================================================
    23     15.2 MiB      0.0 MiB   @profile
    24                             def Synthetic_data_shelve(path, list_cat,list_epoch_time):
    25     15.2 MiB      0.0 MiB       for j in xrange(len(list_cat)):
    26     15.2 MiB      0.0 MiB           if not list_cat[j] in path.keys():
    27     15.2 MiB      0.0 MiB               path[list_cat[j]] = [list_epoch_time[j]]
    28                                     else  :
    29     15.2 MiB      0.0 MiB               path[list_cat[j]] = path[list_cat[j]]+[list_epoch_time[j]]
    30     15.2 MiB      0.0 MiB       return path


Filename: LatLong_shelving.py

Line #    Mem usage    Increment   Line Contents
================================================
    23     15.2 MiB      0.0 MiB   @profile
    24                             def Synthetic_data_shelve(path, list_cat,list_epoch_time):
    25     15.2 MiB      0.0 MiB       for j in xrange(len(list_cat)):
    26     15.2 MiB      0.0 MiB           if not list_cat[j] in path.keys():
    27     15.2 MiB      0.0 MiB               path[list_cat[j]] = [list_epoch_time[j]]
    28                                     else  :
    29     15.2 MiB      0.0 MiB               path[list_cat[j]] = path[list_cat[j]]+[list_epoch_time[j]]
    30     15.2 MiB      0.0 MiB       return path


Filename: LatLong_shelving.py

Line #    Mem usage    Increment   Line Contents
================================================
    32     14.5 MiB      0.0 MiB   @profile
    33                             def shelving():
    34     14.6 MiB      0.1 MiB       dict_user = shelve.open(shelf_filehandle)
    35     15.2 MiB      0.7 MiB       for i in xrange(no_users):
    36     15.2 MiB      0.0 MiB           each_footprint=int(np.random.choice(no_categories, 1,replace=False))
    37     15.2 MiB      0.0 MiB           list_cat=np.random.choice(sub_categories,each_footprint,replace=True)
    38     15.2 MiB      0.0 MiB           list_epoch_time=choice(epoch_time_range,each_footprint,replace=False)
    39     15.2 MiB      0.0 MiB           path =dict_user.get("U"+str(i), dict(defaultdict(dict)))
    40     15.2 MiB      0.0 MiB           path=Synthetic_data_shelve(path, list_cat,list_epoch_time)
    41     15.2 MiB      0.0 MiB           dict_user["U"+str(i)] = path
    42     15.2 MiB      0.0 MiB       dict_user.close()

相关 - python populate a shelve object/dictionary with multiple keys

优化读数和写入搁置对象以扩展更大的数据集

0 个答案: