我有一份专利号码电子表格,我可以通过抓取Google专利,USPTO网站和其他一些数据来获取额外数据。我大部分都在跑步,但有一件事我整天都被困住了。当我去美国专利商标局网站并获得源代码时,它有时会给我整个事情并且工作得非常好,但有时候它只给我关于下半部分(我在寻找的是第一部分) 。
在这里搜索了很多,我还没有看到有这个问题的人。这是相关的一段代码(由于我现在已经尝试解决这个问题,但它有一些冗余,但我确信这是最少的其问题):
from bs4 import BeautifulSoup
import html5lib
import re
import csv
import urllib
import requests
# This is the base URL for Google Patents
gpatbase = "https://www.google.com/patents/US"
ptobase = "http://patft.uspto.gov/netacgi/nph-Parser?Sect2=PTO1&Sect2=HITOFF&p=1&u=/netahtml/PTO/search-bool.html&r=1&f=G&l=50&d=PALL&RefSrch=yes&Query=PN/"
# Bring in the patent numbers and define the writer we'll use to add the new info we get
with open(r'C:\Users\Filepathblahblahblah\Patent Data\scrapeThese.csv', newline='') as csvfile:
patreader = csv.reader(csvfile)
writer = csv.writer(csvfile)
for row in patreader:
patnum = row[0]
#print(row)
print(patnum)
# Take each patent and append it to the base URL to get the actual one
gpaturl = gpatbase + patnum
ptourl = ptobase + patnum
gpatreq = requests.get(gpaturl)
gpatsource = gpatreq.text
soup = BeautifulSoup(gpatsource, "html5lib")
# Find the number of academic citations on that patent
# From the Google Patents page, find the link labeled USPTO and extract the url
for tag in soup.find_all("a"):
if tag.next_element == "USPTO":
uspto_link = tag.get('href')
#uspto_link = ptourl
requested = urllib.request.urlopen(uspto_link)
source = requested.read()
pto_soup = BeautifulSoup(source, "html5lib")
print(uspto_link)
# From the USPTO page, find the examiner's name and save it
for italics in pto_soup.find_all("i"):
if italics.next_element == "Primary Examiner:":
prim = italics.next_element
else:
prim = "Not found"
if prim != "Not found":
examiner = prim.next_element
else:
examiner = "Not found"
print(examiner)
截至目前,关于我是否会获得考官姓名或者未找到考试名称大约50-50,"而且我也看不到任何一个小组的成员彼此有共同之处,所以我完全没有想法。
答案 0 :(得分:1)
我仍然不知道是什么导致了这个问题,但如果有人遇到类似的问题,我就能找到解决方法。如果您将源代码发送到文本文件而不是尝试直接使用它,它将不会被切断。我想问题是在数据下载之后,但在导入到“工作区”之前。这是我在刮刀中写的代码片段:
if examiner == "Examiner not found":
filename = r'C:\Users\pathblahblahblah\Code and Output\Scraped Source Code\scraper_errors_' + patnum + '.html'
sys.stdout = open(filename, 'w')
print(patnum)
print(pto_soup.prettify())
sys.stdout = console_out
# Take that logged code and find the examiner name
sec = "Not found"
prim = "Not found"
scraped_code = open(r'C:\Users\pathblahblahblah\Code and Output\Scraped Source Code\scraper_errors_' + patnum + '.txt')
scrapedsoup = BeautifulSoup(scraped_code.read(), 'html5lib')
# Find all italics (<i>) tags
for italics in scrapedsoup.find_all("i"):
for desc in italics.descendants:
# Check to see if any of them affect the words "Primary Examiner"
if "Primary Examiner:" in desc:
prim = desc.next_element.strip()
#print("Primary found: ", prim)
else:
pass
# Same for "Assistant Examiner"
if "Assistant Examiner:" in desc:
sec = desc.next_element.strip()
#print("Assistant found: ", sec)
else:
pass
# If "Secondary Examiner" in there, set 'examiner' to the next string
# If there is no secondary examiner, use the primary examiner
if sec != "Not found":
examiner = sec
elif prim != "Not found":
examiner = prim
else:
examiner = "Examiner not found"
# Show new results in the console
print(examiner)