除了ET的findall()方法之外,如何创建新的字符串输出

时间:2018-07-08 15:26:56

标签: xml python-3.x mongodb pymongo elementtree

请参阅以下材料以获取有关我遇到的问题的背景信息。因此,我需要Python解析器脚本进行解析,以使mongodb文档的每个“ orgAction”和“ orgActionDate”都有一个“ orgName”,“ orgExpertise”和“ orgType”。尽管ET的findall()方法确实找到了我需要的所有元素文本,但xml文件(我无法更改)的编写方式使得使用ET的findall()会产生不均匀的结果。看一下我上传的最后两个文档,看看我现在解析的内容和我想要解析的内容之间的区别。

import xml.etree.ElementTree as ET
from pymongo import MongoClient
import pymongo.errors


client = MongoClient('localhost:27017')
db = client.orgsdb


with open('globalOrgs.xml', 'r', encoding='utf8') as globalOrgsData:
    #  MONGO COLLECTION: 'globalOrgs'
    data = {
    'orgName': [],
    'orgExpertise': [],
    'orgType': [],
    'orgAction': [],
    'orgActionDate': [],
    'subDivName': [],
    'subDivAction': [],
    'subDivActionDate': []
}
globalOrgsRead = globalOrgsData.read()
root = ET.fromstring(globalOrgsRead)
for r in root.findall('./document1/orgs/item/name'):
    data['orgName'].append(r.text)
for r in root.findall('./document1/orgs/item/expertise'):
    data['orgExpertise'].append(r.text)
for r in root.findall('./document1/orgs/item/type'):
    data['orgType'].append(r.text)
for r in root.findall('./document1/orgs/item/actions/item/name'):
    data['orgAction'].append(r.text)
for r in root.findall('./document1/orgs/item/actions/item/date'):
    data['orgActionDate'].append(r.text)
for r in root.findall('./document1/orgs/item/subDivisions/item/name'):
    data['subDivName'].append(r.text)
for r in root.findall('./document1/orgs/item/subDivisions/item/actions/item/name'):
    data['subDivAction'].append(r.text)
for r in root.findall('./document1/orgs/item/subDivisions/item/actions/item/date'):
    data['subDivActionDate'].append(r.text)
try:
    db.globalOrgs.update_one(data, {'$set': {'orgName': data['orgName'], 'orgExpertise': data['orgExpertise'], 'orgType': data['orgType'], 'orgAction': data['orgAction'], 'orgActionDate': data['orgActionDate'], 'subDivName': data['subDivName'], 'subDivAction': data['subDivAction'], 'subDivActionDate': data['subDivActionDate']}}, upsert=True)
except pymongo.errors.ConnectionFailure as e:
    print(e)

还有xml文件:

<?xml version="1.0" encoding="UTF-8"?>
<globalOrgs>
  <document1>
    <orgs>
      <item>
        <name>Amnesty International</name>
        <expertise>Human Rights</expertise>
    <type>NonProfit</type>
    <actions>
      <item>
        <name>Issued statement on Syria</name>
        <date>2017-11-16</date>
      </item>
    </actions>
    <subDivisions>
      <item>
        <name>Europe and Eurasia Division</name>
        <actions>
          <item>
            <name>Speech at The Hague</name>
            <date>2017-06-27</date>
          </item>
        </actions>
      </item>
    </subDivisions>
  </item>
  <item>
    <name>Goldman Sachs</name>
    <expertise>Finance</expertise>
    <type>Profit</type>
    <subDivisions/>
    <actions>
      <item>
        <name>2017 Q4 Shareholder Meeting</name>
        <date>2017-11-15</date>
      </item>
      <item>
        <name>Investor Summit</name>
        <date>2017-11-13T17:01:15Z</date>
      </item>
    </actions>
  </item>
</orgs>

我的mongo集合“ globalOrgs”的输出:

{
    "_id" : ObjectId("5b33b4ad6a32f62792924c36"),
    "orgAction" : [
        "Issued statement on Syria",
        "2017 Q4 Shareholder Meeting",
        "Investor Summit"
    ],
    "orgActionDate" : [
        "2017-11-16T16:40:54Z",
        "2017-11-15T21:50:16Z",
        "2017-11-13T17:01:15Z"
    ],
    "orgExpertise" : [
        "Human Rights",
        "Finance"
    ],
    "orgName" : [
        "Amnesty International",
        "Goldman Sachs"
    ],
    "orgType" : [
        "NonProfit",
        "Profit"
    ],
    "subDivAction" : [
        "Speech at The Hague"
    ],
    "subDivActionDate" : [
         "2017-06-27T16:45:35Z"
    ],
    "subDivName" : [
        "Europe and Eurasia Division"
    ]
}

最后,这是我想让输出看起来像的样子:

{
    "_id" : ObjectId("5b33b4ad6a32f62792924c36"),
    "orgAction" : [
        "Issued statement on Syria",
        "2017 Q4 Shareholder Meeting",
        "Investor Summit"
    ],
    "orgActionDate" : [
        "2017-11-16T16:40:54Z",
        "2017-11-15T21:50:16Z",
        "2017-11-13T17:01:15Z"
    ],
    "orgExpertise" : [
        "Human Rights",
        "Finance",
        "Finance"
    ],
    "orgName" : [
        "Amnesty International",
        "Goldman Sachs",
        "Goldman Sachs"
    ],
    "orgType" : [
        "NonProfit",
        "Profit",
        "Profit"
    ],
    "subDivAction" : [
        "Speech at The Hague"
    ],
    "subDivActionDate" : [
         "2017-06-27T16:45:35Z"
    ],
    "subDivName" : [
        "Europe and Eurasia Division"
    ]
}

请注意“ orgExpertise”,“ orgName”和“ orgType”中的其他条目,以匹配与“ orgAction”和“ orgActionDate”关联的条目数。还请注意,出于我们此处的目的,由于每个条目彼此对应,因此无需修改“ subDiv”字段。

0 个答案:

没有答案