Question

我是Pandas（Dataframe）的新手。

我在磁盘中有大约43000个文档，我逐个处理每个文档。在处理时我从文档中提取实体和关键字。我将文档ID和实体频率存储在pandas数据帧中。总实体的数量很可能超过一百万，构建数据帧的大小为43000 * 10,00,000。因此，我无法将整个数据帧存储在内存中。

我需要一种有效的方法将数据帧存储在磁盘中，并将每一行（dic_id和实体freq）逐个附加到数据帧。

我的数据框架如下：

         India   America   Sam   New York    Las Vegas    Football .............
doc1       3        1       6      1            0             0    ..........
doc2       0        0       0      0            0             2    ..........

我想要实现的目标：

for documents in disk:
    read document  (Example: doc1, doc2....)
    find entities and their frequency (Example: India, America , New York, Las Vegas....)
    insert the entities into a dataframe stored in the disk.

有效的存储方式将行追加到存储在磁盘中的数据框中。

0 个答案: