在mongodb中为Web scraper分配UUID并检查重复项

时间:2017-03-02 21:06:21

标签: python mongodb web-scraping scrapy pymongo

我正在构建一个Web scraper并尝试为实体分配一个UUID。

由于可能会在不同时间抓取一个实体,我想将初始UUID与从网页中提取的ID一起存储

// example document
{
 "ent_eid_type": "ABC-123", 
 "ent_uid_type": "123e4567-aaa-123e456" 
}
下面的

是针对在已删除项目中找到的每个id字段运行的代码

 # if the current ent_eid_type is a key in mongo...
if db_coll.find({ent_eid_type: ent_eid}).count() > 0:

     # return the uid value  
    ent_uid = db_coll.find({ent_uid_type: ent_uid })
else:
     # create a fresh uid 
    ent_uid = uuid.uuid4()

     # store it with the current entity eid as key, and uid as value
    db_coll.insert({ent_eid_type: ent_eid, ent_uid_type: ent_uid})

# update the current item with the stored uid for later use   
item[ent_uid_type] = ent_uid

控制台正在返回KeyError: <pymongo.cursor.Cursor object at 0x104d41710>。不确定如何解析ent_uid

的光标

任何提示/建议表示赞赏!

1 个答案:

答案 0 :(得分:1)

Pymongo Find command returns a cursor object you need to iterate or access to get the object

Access the first result (you already checked one exists), and access the ent_uid field.

Presumably, you're going to search on EID type, with ent_eid not ent_uid. No reason to search if you already have it.

ent_uid = db_coll.find({ent_eid_type: ent_eid })[0]['ent_uid']

or don't worry about the cursor and use the find_one command instead (http://api.mongodb.com/python/current/api/pymongo/collection.html#pymongo.collection.Collection.find_one)

ent_uid = db_coll.find_one({ent_eid_type: ent_eid })['ent_uid']