我试图解析来自例如this page的专利数据。最终输出应为CSV文件,每个专利一行,并且(每个行上有受让人,家庭ID和提交日期)。 我正在使用BeautifulSoup,我很好地检索大部分信息并编写CSV文件。
我的问题是我注意到表格的结构会随着时间而变化;并非所有领域都在那里。例如。在给出的示例中,没有家庭ID。 因此,我不能将每个单元格分配给特定变量(如here)。根据报告的字段数,行/列的数量也会发生变化。
我想编写代码,以便它足够灵活,可以执行以下操作: 如果header ==" Assignee",则从该单元格中获取文本。否则,请留空。 如果header ==" Family ID",则从该单元格中获取文本。否则,请留空。
这样最终输出类似于:
Assignee, Family ID, filing date
"Potomac Aviation", , "June 11, 2002"
"Anonymous Co", 40432687, "June 5, 2016"
etc.
这段代码让我最接近,但我还远离我想要结束的地方。
fourth_table=table.find_next("table")
header_1 = fourth_table.find('th')
if header_1.get_text() == "Inventors:":
inventors=fourth_table.find('td').get_text()
header1=fourth_table.th
header_2 = header1.find_next('th')
cell1=fourth_table.td
cell2=cell1.find_next('td')
if header_2.get_text() == "Applicant:":
applicant= cell2.find('td').get_text()
显然很罗嗦;一旦确定我理解每个位的工作原理,我将尝试使代码更有效。
编辑:这是我认为让我更接近的另一种选择。然而,虽然它适用于" Assignee",python返回"无"用于print(family_id)行。我已经检查了拼写。
fourth_table=table.find_next("table")
assignee=fourth_table.find(text="Assignee:").findNext('td').get_text().replace("\n","").strip()
#family_id=fourth_table.find(text="Family ID:").findNext('td').get_text().replace("\n","").strip()
family_id=fourth_table.find(text="Family ID:")
print(family_id)
如果我遗漏了一些明显的东西,请道歉。 TIA!
答案 0 :(得分:0)
它涉及更多,你真的需要使用正则表达式找到你想要的行,因为文本中有换行符所以你不能只指定text="Family ID":
等..这将得到所有的来自站点的链接并写下三行感兴趣的行,如果行是n,那么csv将有一个缺少数据的空条目:
import csv
import re
def get_nxt_link(soup):
nxt = soup.select_one("img[src=/netaicon/PTO/nextdoc.gif]")
if nxt:
return urljoin(base, nxt.parent["href"])
return False
start = 'http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=PTXT&s1=boston&s2=2005$.PD.&OS=boston%20AND%20ISD/2005&RS=boston%20AND%20ISD/2005'
base = "http://patft.uspto.gov/"
def get_text(soup, txt):
for t in txt:
_tag = soup.find("th", text=t)
yield _tag.parent.td.text.replace("\n", "") if _tag else ""
with open("out.csv", "w") as o:
wr = csv.writer(o)
soup = BeautifulSoup(requests.get(start).content, "lxml")
table = soup.find("th", text="Assignee:").find_previous("table")
nxt = get_nxt_link(soup)
wr.writerow(["Assignee", "Family ID", "filing date"])
text = [re.compile("Assignee:"), re.compile("Family ID:"), re.compile("Filed:")]
wr.writerow(tuple(get_text(table, text)))
while nxt:
soup = BeautifulSoup(requests.get(nxt).content, "lxml")
table = soup.find("th", text="Assignee:").find_previous("table")
wr.writerow(tuple(get_text(table, text)))
nxt = get_nxt_link(soup)
在我们得到的前几个网址上运行:
Assignee,Family ID,filing date
"Potomac Aviation Technology Corp. (Boston, MA)",,"June 11, 2002"
"Microsoft Corp. (Redmond, WA)",33563549,"December 31, 1998"
"Teradyne, Inc. (Boston, MA)",32029308,"September 27, 2002"
"First Data Corporation (Greenwood Village, CO)",22696181,"December 5, 2001"
"Micron Technology, Inc. (Boise, ID)",35482789,"July 8, 2003"
"Digital River, Inc. (Eden Prairie, MN)",38226542,"December 11, 2000"
"Oracle International Corporation (Redwood Shores, CA)",35482753,"October 1, 2002"
"The United States of America as represented by the Secretary of the Navy (Washington, DC)",35482734,"October 6, 2003"
Garmin Ltd. (KY),21836040,"November 19, 2004"
"Lucent Technologies Inc. (Murray Hill, NJ)",25439780,"July 30, 2001"
"The Chamberlain Group, Inc. (Elmhurst, IL)",23942291,"October 17, 2001"
"Inplane Photonics, Inc. (South Plainfield, NJ)",32824042,"February 7, 2003"
"Motorola, Inc (Horsham, PA)",22314958,"November 17, 2000"
"Xerox Corporation (Stamford, CT)",35482631,"July 7, 2004"
"General Hospital Corporation (Boston, MA)",35480219,"October 16, 2002"