请参阅以下材料以获取有关我遇到的问题的背景信息。因此,我需要Python解析器脚本进行解析,以使mongodb文档的每个“ orgAction”和“ orgActionDate”都有一个“ orgName”,“ orgExpertise”和“ orgType”。尽管ET的findall()方法确实找到了我需要的所有元素文本,但xml文件(我无法更改)的编写方式使得使用ET的findall()会产生不均匀的结果。看一下我上传的最后两个文档,看看我现在解析的内容和我想要解析的内容之间的区别。
import xml.etree.ElementTree as ET
from pymongo import MongoClient
import pymongo.errors
client = MongoClient('localhost:27017')
db = client.orgsdb
with open('globalOrgs.xml', 'r', encoding='utf8') as globalOrgsData:
# MONGO COLLECTION: 'globalOrgs'
data = {
'orgName': [],
'orgExpertise': [],
'orgType': [],
'orgAction': [],
'orgActionDate': [],
'subDivName': [],
'subDivAction': [],
'subDivActionDate': []
}
globalOrgsRead = globalOrgsData.read()
root = ET.fromstring(globalOrgsRead)
for r in root.findall('./document1/orgs/item/name'):
data['orgName'].append(r.text)
for r in root.findall('./document1/orgs/item/expertise'):
data['orgExpertise'].append(r.text)
for r in root.findall('./document1/orgs/item/type'):
data['orgType'].append(r.text)
for r in root.findall('./document1/orgs/item/actions/item/name'):
data['orgAction'].append(r.text)
for r in root.findall('./document1/orgs/item/actions/item/date'):
data['orgActionDate'].append(r.text)
for r in root.findall('./document1/orgs/item/subDivisions/item/name'):
data['subDivName'].append(r.text)
for r in root.findall('./document1/orgs/item/subDivisions/item/actions/item/name'):
data['subDivAction'].append(r.text)
for r in root.findall('./document1/orgs/item/subDivisions/item/actions/item/date'):
data['subDivActionDate'].append(r.text)
try:
db.globalOrgs.update_one(data, {'$set': {'orgName': data['orgName'], 'orgExpertise': data['orgExpertise'], 'orgType': data['orgType'], 'orgAction': data['orgAction'], 'orgActionDate': data['orgActionDate'], 'subDivName': data['subDivName'], 'subDivAction': data['subDivAction'], 'subDivActionDate': data['subDivActionDate']}}, upsert=True)
except pymongo.errors.ConnectionFailure as e:
print(e)
还有xml文件:
<?xml version="1.0" encoding="UTF-8"?>
<globalOrgs>
<document1>
<orgs>
<item>
<name>Amnesty International</name>
<expertise>Human Rights</expertise>
<type>NonProfit</type>
<actions>
<item>
<name>Issued statement on Syria</name>
<date>2017-11-16</date>
</item>
</actions>
<subDivisions>
<item>
<name>Europe and Eurasia Division</name>
<actions>
<item>
<name>Speech at The Hague</name>
<date>2017-06-27</date>
</item>
</actions>
</item>
</subDivisions>
</item>
<item>
<name>Goldman Sachs</name>
<expertise>Finance</expertise>
<type>Profit</type>
<subDivisions/>
<actions>
<item>
<name>2017 Q4 Shareholder Meeting</name>
<date>2017-11-15</date>
</item>
<item>
<name>Investor Summit</name>
<date>2017-11-13T17:01:15Z</date>
</item>
</actions>
</item>
</orgs>
我的mongo集合“ globalOrgs”的输出:
{
"_id" : ObjectId("5b33b4ad6a32f62792924c36"),
"orgAction" : [
"Issued statement on Syria",
"2017 Q4 Shareholder Meeting",
"Investor Summit"
],
"orgActionDate" : [
"2017-11-16T16:40:54Z",
"2017-11-15T21:50:16Z",
"2017-11-13T17:01:15Z"
],
"orgExpertise" : [
"Human Rights",
"Finance"
],
"orgName" : [
"Amnesty International",
"Goldman Sachs"
],
"orgType" : [
"NonProfit",
"Profit"
],
"subDivAction" : [
"Speech at The Hague"
],
"subDivActionDate" : [
"2017-06-27T16:45:35Z"
],
"subDivName" : [
"Europe and Eurasia Division"
]
}
最后,这是我想让输出看起来像的样子:
{
"_id" : ObjectId("5b33b4ad6a32f62792924c36"),
"orgAction" : [
"Issued statement on Syria",
"2017 Q4 Shareholder Meeting",
"Investor Summit"
],
"orgActionDate" : [
"2017-11-16T16:40:54Z",
"2017-11-15T21:50:16Z",
"2017-11-13T17:01:15Z"
],
"orgExpertise" : [
"Human Rights",
"Finance",
"Finance"
],
"orgName" : [
"Amnesty International",
"Goldman Sachs",
"Goldman Sachs"
],
"orgType" : [
"NonProfit",
"Profit",
"Profit"
],
"subDivAction" : [
"Speech at The Hague"
],
"subDivActionDate" : [
"2017-06-27T16:45:35Z"
],
"subDivName" : [
"Europe and Eurasia Division"
]
}
请注意“ orgExpertise”,“ orgName”和“ orgType”中的其他条目,以匹配与“ orgAction”和“ orgActionDate”关联的条目数。还请注意,出于我们此处的目的,由于每个条目彼此对应,因此无需修改“ subDiv”字段。