所以我遇到了一个我正在解析XML文件的问题:
soup = BeautifulSoup(xml_string, "lxml")
pub_ref = soup.findAll("publication-reference")
with open('./output.csv', 'ab+') as f:
writer = csv.writer(f, dialect = 'excel')
for info in pub_ref:
assign = soup.findAll("assignee")
pat_cite = soup.findAll("patcit")
for item1 in assign:
if item.find("orgname"):
org_name = item.find("orgname").text
for item2 in pat_cite:
if item2.find("name"):
name = item2.find("name").text
for inv_name, pat_num, cpc_num, class_num, subclass_num, date_num, country, city, state in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("section"), soup.findAll("class"), soup.findAll("subclass"), soup.findAll("date"), soup.findAll("country"), soup.findAll("city"), soup.findAll("state")):
writer.writerow([inv_name.text, pat_num.text, org_name, cpc_num.text, class_num.text, subclass_num.text, date_num.text, country.text, city.text, state.text, name])
我只限于几个元素(如最后的文本条目所示)但我现在还有大约10个父元素,我需要解析30多个子元素,所以明确说明它们全部像这样赢了& #39;真的很好用了。另外,我在数据中重复了一遍:
<us-references-cited>
<us-citation>
<patcit num="00001">
<document-id>
<country>US</country>
<doc-number>1589850</doc-number>
<kind>A</kind>
<name>Haskell</name>
<date>19260600</date>
</document-id>
</patcit>
<category>cited by applicant</category>
</us-citation>
<us-citation>
<patcit num="00002">
<document-id>
<country>US</country>
<doc-number>D134414</doc-number>
<kind>S</kind>
<name>Orme, Jr.</name>
<date>19421100</date>
</document-id>
</patcit>
<category>cited by applicant</category>
</us-citation>
<us-citation>
我希望能够将重复的子根(例如patcit)解析为我的CSV文件,如下所示:
invention name country city .... patcit name1 patcit date1....
white space patcit name2 patcit date2....
white space patcit name2 patcit date3....
等等......因为每个发明都有不止一个引用或参考,它只有一列大部分其他信息。
答案 0 :(得分:1)
尝试以下脚本。我想这就是你想要的。
from bs4 import BeautifulSoup
xml_content='''
<us-references-cited>
<us-citation>
<patcit num="00001">
<document-id>
<country>US</country>
<doc-number>1589850</doc-number>
<kind>A</kind>
<name>Haskell</name>
<date>19260600</date>
</document-id>
</patcit>
<category>cited by applicant</category>
</us-citation>
<us-citation>
<patcit num="00002">
<document-id>
<country>US</country>
<doc-number>D134414</doc-number>
<kind>S</kind>
<name>Orme, Jr.</name>
<date>19421100</date>
</document-id>
</patcit>
<category>cited by applicant</category>
</us-citation>
<us-citation>
'''
soup = BeautifulSoup(xml_content,"lxml")
for item in soup.select("patcit[num^=000]"):
name = item.select("name")[0].text
date = item.select("date")[0].text
kind = item.select("kind")[0].text
doc_number = item.select("doc-number")[0].text
country = item.select("country")[0].text
print(name,date,kind,doc_number,country)
结果:
Haskell 19260600 A 1589850 US
Orme, Jr. 19421100 S D134414 US
此解决方案适用于您稍后提供的链接:
import requests
from bs4 import BeautifulSoup
res = requests.get("https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2017/")
soup = BeautifulSoup(res.text,"lxml")
table = soup.select("table")[1]
for items in table.select("tr"):
data = ' '.join([item.text for item in items.select("td")])
print(data)