我一般对python和网络抓取还很陌生。下面的代码可以工作,但是对于它实际通过的信息量来说似乎太慢了。有什么方法可以轻松减少执行时间。我不确定,但看来我输入了很多文字/使其变得比我实际需要的更加困难,我们将不胜感激。
当前代码从站点地图开始,然后遍历其他站点地图列表。在新的站点地图中,它提取数据信息以构造网页的json数据的url。从json数据中,我提取了一个用于搜索字符串的xml链接。如果找到该字符串,则会将其附加到文本文件中。
#global variable
start = 'https://www.govinfo.gov/wssearch/getContentDetail?packageId='
dash = '-'
urlSitemap="https://www.govinfo.gov/sitemap/PLAW_sitemap_index.xml"
old_xml=requests.get(urlSitemap)
print (old_xml)
new_xml= io.BytesIO(old_xml.content).read()
final_xml=BeautifulSoup(new_xml)
linkToBeFound = final_xml.findAll('loc')
for loc in linkToBeFound:
urlPLmap=loc.text
old_xmlPLmap=requests.get(urlPLmap)
print(old_xmlPLmap)
new_xmlPLmap= io.BytesIO(old_xmlPLmap.content).read()
final_xmlPLmap=BeautifulSoup(new_xmlPLmap)
linkToBeFound2 = final_xmlPLmap.findAll('loc')
for pls in linkToBeFound2:
argh = pls.text.find('PLAW')
theWanted = pls.text[argh:]
thisShallWork =eval(requests.get(start + theWanted).text)
print(requests.get(start + theWanted))
dict1 = (thisShallWork['download'])
finaldict = (dict1['modslink'])[2:]
print(finaldict)
url2='https://' + finaldict
try:
old_xml4=requests.get(url2)
print(old_xml4)
new_xml4= io.BytesIO(old_xml4.content).read()
final_xml4=BeautifulSoup(new_xml4)
references = final_xml4.findAll('identifier',{'type': 'Statute citation'})
for sec in references:
if sec.text == "106 Stat. 4845":
Print(dash * 20)
print(sec.text)
Print(dash * 20)
sec313 = open('sec313info.txt','a')
sec313.write("\n")
sec313.write(pls.text + '\n')
sec313.close()
except:
print('error at: ' + url2)
答案 0 :(得分:0)
不知道为什么我花了这么长时间,但是我做到了。您的代码确实很难浏览。因此,我首先将其分为两部分,从站点地图中获取链接,然后再获取其他内容。我也将一些内容分解为单独的功能。 这正在我的机器上每秒检查大约2个url,这似乎是正确的。 效果如何(您可以就这一部分与我争论)。
# returns sitemap links
def get_links(s):
old_xml = requests.get(s)
new_xml = old_xml.text
final_xml = BeautifulSoup(new_xml, "lxml")
return final_xml.findAll('loc')
# gets the final url from your middle url and looks through it for the thing you are looking for
def scrapey(link):
link_id = link[link.find("PLAW"):]
r = requests.get('https://www.govinfo.gov/wssearch/getContentDetail?packageId={}'.format(link_id))
print(r.url)
try:
r = requests.get("https://{}".format(r.json()["download"]["modslink"][2:]))
print(r.url)
soup = BeautifulSoup(r.text, "lxml")
references = soup.findAll('identifier', {'type': 'Statute citation'})
for ref in references:
if ref.text == "106 Stat. 4845":
return r.url
else:
return False
except:
print("bah" + r.url)
return False
sitemap_links_el = get_links("https://www.govinfo.gov/sitemap/PLAW_sitemap_index.xml")
sitemap_links = map(lambda x: x.text, sitemap_links_el)
nlinks_el = map(get_links, sitemap_links)
links = [num.text for elem in nlinks_el for num in elem]
with open("output.txt", "a") as f:
for link in links:
url = scrapey(link)
if url is False:
print("no find")
else:
print("found on: {}".format(url))
f.write("{}\n".format(url))