使用BeautifulSoup解析XML在Python中重复子根

时间:2017-11-18 05:35:51

标签: python xml parsing beautifulsoup

所以我遇到了一个我正在解析XML文件的问题:

soup = BeautifulSoup(xml_string, "lxml")  
pub_ref = soup.findAll("publication-reference") 

with open('./output.csv', 'ab+') as f:
    writer = csv.writer(f, dialect = 'excel')

    for info in pub_ref:  
        assign = soup.findAll("assignee")
        pat_cite = soup.findAll("patcit")

        for item1 in assign: 
            if item.find("orgname"):
                org_name = item.find("orgname").text

        for item2 in pat_cite:
            if item2.find("name"):
                name = item2.find("name").text


        for inv_name, pat_num, cpc_num, class_num, subclass_num, date_num, country, city, state in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("section"), soup.findAll("class"), soup.findAll("subclass"), soup.findAll("date"), soup.findAll("country"), soup.findAll("city"), soup.findAll("state")):

            writer.writerow([inv_name.text, pat_num.text, org_name, cpc_num.text, class_num.text, subclass_num.text, date_num.text, country.text, city.text, state.text, name])

我只限于几个元素(如最后的文本条目所示)但我现在还有大约10个父元素,我需要解析30多个子元素,所以明确说明它们全部像这样赢了& #39;真的很好用了。另外,我在数据中重复了一遍:

<us-references-cited>
<us-citation>
<patcit num="00001">
<document-id>
<country>US</country>
<doc-number>1589850</doc-number>
<kind>A</kind>
<name>Haskell</name>
<date>19260600</date>
</document-id>
</patcit>
<category>cited by applicant</category>
</us-citation>
<us-citation>
<patcit num="00002">
<document-id>
<country>US</country>
<doc-number>D134414</doc-number>
<kind>S</kind>
<name>Orme, Jr.</name>
<date>19421100</date>
</document-id>
</patcit>
<category>cited by applicant</category>
</us-citation>
<us-citation>

我希望能够将重复的子根(例如patcit)解析为我的CSV文件,如下所示:

invention name  country   city  .... patcit name1  patcit date1....
              white space            patcit name2  patcit date2....
              white space            patcit name2  patcit date3....

等等......因为每个发明都有不止一个引用或参考,它只有一列大部分其他信息。

1 个答案:

答案 0 :(得分:1)

尝试以下脚本。我想这就是你想要的。

from bs4 import BeautifulSoup

xml_content='''
<us-references-cited>
<us-citation>
<patcit num="00001">
<document-id>
<country>US</country>
<doc-number>1589850</doc-number>
<kind>A</kind>
<name>Haskell</name>
<date>19260600</date>
</document-id>
</patcit>
<category>cited by applicant</category>
</us-citation>
<us-citation>
<patcit num="00002">
<document-id>
<country>US</country>
<doc-number>D134414</doc-number>
<kind>S</kind>
<name>Orme, Jr.</name>
<date>19421100</date>
</document-id>
</patcit>
<category>cited by applicant</category>
</us-citation>
<us-citation>
'''
soup = BeautifulSoup(xml_content,"lxml")
for item in soup.select("patcit[num^=000]"):
    name = item.select("name")[0].text
    date = item.select("date")[0].text
    kind = item.select("kind")[0].text
    doc_number = item.select("doc-number")[0].text
    country = item.select("country")[0].text
    print(name,date,kind,doc_number,country)

结果:

Haskell 19260600 A 1589850 US
Orme, Jr. 19421100 S D134414 US

此解决方案适用于您稍后提供的链接:

import requests
from bs4 import BeautifulSoup

res = requests.get("https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2017/")
soup = BeautifulSoup(res.text,"lxml")
table = soup.select("table")[1]
for items in table.select("tr"):
    data = ' '.join([item.text for item in items.select("td")])
    print(data)