Question

我一直坚持这个问题一段时间但没有解决方案。我有一个我的Python脚本片段，如下所示：

pub_ref = soup.findAll("publication-reference") 

with open('./output.csv', 'ab+') as f:
    writer = csv.writer(f, dialect = 'excel')

    for info in pub_ref:  
        pat_cite = soup.findAll("patcit")
        for item in pat_cite:
            if item.find("name"):
                name = item.find("name").text

            writer.writerow([name])

这部分脚本我要解析父母＆＃34; publication-reference＆＃34; <的引用子根＆＃34; pacit＆＃34; 的孩子/ em>在XML文件中多次出现，如下所示：

. . . <us-references-cited> <us-citation> <patcit num="00001"> <document-id> <country>US</country> <doc-number>1589850</doc-number> <kind>A</kind> <name>Haskell</name> <date>19260600</date> </document-id> </patcit> <category>cited by applicant</category> </us-citation> <us-citation> <patcit num="00002"> <document-id> <country>US</country> <doc-number>D134414</doc-number> <kind>S</kind> <name>Orme, Jr.</name> <date>19421100</date> </document-id> </patcit> <category>cited by applicant</category> </us-citation> <us-citation> . . .

这些圆点表示文件大于此，并且没有显示父根＆＃34; publication-reference＆＃34;。问题是我的脚本只解析了pacit的许多孩子中的一个，＆＃34; name＆＃34; root，你可以告诉他们。这适用于那些每个发明只有一个条目但不是多个条目的根。

我还希望将这些存储在CSV文件中，正如您可以在编写器中看到的那样，输出会在列中显示这些多个patcit引用，如下所示：

invention name country city .... patcit name1 patcit date1.... white space patcit name2 patcit date2.... white space patcit name2 patcit date3....

我可以在https://bulkdata.uspto.gov/data/patent/grant/redbook/fulltext/2017/
找到我正在使用的XML文件
任何帮助都会受到赞赏，因为我尝试了多种方式，我觉得这是一个初学者的问题。

Answer 1

首先，我下载了一个zip文件“ipg170103.zip”，发现它包含多个xml文档。所以我跑了（在Linux上）

csplit ipg170103.xml '/xml version/' '{*}'

将文件拆分为多个单个文档。使用其中一个文件“xx995”，我设法看到你正在使用的是什么。在“国家”文件上使用“grep”我发现了这个词的很多实例，所以我猜你想要“发布 - 参考”下的“国家”（如果不是你将不得不改变剧本）和同样的“发明”来自“发明标题”。我还在“patcit”下发现了“date”的多个实例，并非所有实例都有它们的名称所以我的脚本省略了这些。我找到了太多的“城市”元素来知道你想要哪一个。但无论如何我无法确切地确定你想要什么，所以你可能需要根据你的确切需要调整一下。

from bs4 import BeautifulSoup
import csv

xml = open("xx995",'r').read()
soup = BeautifulSoup(xml, 'lxml')
pat = soup.find("us-patent-grant")

country = pat.find("publication-reference").find("country").text
invention = pat.find("invention-title").text

data = []
pat_cite = pat.findAll("patcit")
for item in pat_cite:
    name = None
    date = None
    if item.find("name"):
        name = item.find("name").text
        # Only get date if name
        if item.find("date"):
            date = item.find("date").text
        data.append((name,date))

with open('./output.csv', 'wt') as f:
    writer = csv.writer(f, dialect='excel')
    writer.writerow(('invention', 'country', 'patcit name', 'patcit date'))
    for d in data:
        writer.writerow((invention, country, d[0], d[1]))
        invention = None
        country = None

输出：

使用Python中的BeautifulSoup解析具有不同数据的重复标记的XML文件

1 个答案: