Question

我使用Python中的PyMongo库在MongoDB中插入文档。 pandas数据帧有37个字段和60k记录（链接到数据集：https://drive.google.com/open?id=119T4uhvHc7CAwJgZRselWXpstAQhkj90）。数据框中的所有字段都已转换为str类型。我收到以下错误：

OverflowError: MongoDB can only handle up to 8-byte ints

当我使用for循环插入2500个文档块时，错误仍然存在。

代码段：

import pandas as pd
import pymongo

client = pymongo.MongoClient()
db = client['patenting_in_psi']
collection = db['sample5']

df=pd.read_excel(r"C:\Users\mazin\1-601.xlsx")

collection.insert_many((df.to_dict('records')))

Answer 1

在将数据帧转换为字典之前，需要对缺少数据的某些字段进行规范化。

DWPIAccessionNumber的值必须标准化。例如，记录号2524是一个64位整数，其值为20100000000000001078890512051682672902079220850980264522702989250781417512482524046970942628331980756243236345447307144055181790035144112662138043858072629370129477827567049201927634798584141270252235498775249725404749823022689297835494826055102466304887343437187655164225642338109880434082104977849399115776。这可以转换为bson.int64.Int64类型或方便地输入为str（有些值为str的情况 - 请参阅记录编号23或nan ）

df['DWPIAccessionNumber'] = df['DWPIAccessionNumber'].astype(str)

此外，PublicationDate字段也需要进行规范化。例如，在记录号24696中，缺少其值。您可以删除该字段，设置一些日期或将其填充为零。

df['PublicationDate'].fillna(0, inplace=True)

现在，您的数据已准备好转换为字典然后插入。

使用Python中的PyMongo在MongoDB中插入文档

1 个答案: