使用Python将大数据加载到列表

时间:2019-07-02 16:23:47

标签: python bigdata

我正在从纸上运行重新处理代码。雅虎数据集为699640226行。我运行代码,错误为

> 2nd pass training: 359000000 2nd pass training: 360000000 2nd pass
> training: 361000000 Traceback (most recent call last):   File
> "/usit/abel/u1/cnphuong/.local/opt/nomad/Scripts/convert.py", line 80,
> in <module>
>     train_values.append(float(tokens[2])) MemoryError```
> 2. I run on server with 32 and 60GB ram but there are the same error. 
> 
> ```python
> # now parse the data train_user_indices = list() train_item_indices = list() train_values = list() for index, line in
> enumerate(open(train_filename)):
>     if index % 1000000 == 0:
>         print "2nd pass training:", index
>     tokens = line.split(" ")
>     train_user_indices.append(user_indexer[tokens[0]])
>     train_item_indices.append(item_indexer[tokens[1]])
>     train_values.append(float(tokens[2])) 

请告诉我最好的方法来将所有数据添加到列表中,因为作者可以使用该文件(〜11GB和699640226)运行

1 个答案:

答案 0 :(得分:0)

如果您正在使用TensorFlow,则已经内置了工具,因此您可以在文件结构上进行训练而不必将整体加载到RAM中。参见the documentation