我喜欢从python 3编写的网站获得某些链接。我曾尝试自己编写但我失败了(成为初学者)。
我希望脚本能够执行以下操作:
到目前为止,这是我的代码:
#!/usr/bin/env python3
import urllib2
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
url = input('Please, enter url: ')
try:
keyword = string(input('Type keyword(s): '))
except ValueError:
print('You must enter a string value.')
driver = webdriver.Firefox()
urls = driver.find_elements_by_xpath('keyword')
for url in urls:
print url.get_attribute("href")
file = open('urls.txt', 'w')
f.write(url)
f.close()
答案 0 :(得分:1)
一般的答案是:
url = 'https://familysearch.org/search/image/index?owc=Q69L-N6T%3A116559001%2C116559002%2C116559003%3Fcc%3D1601210'
keyword = 'matrimonios 2000'
html = requests.get(url).content
soup = BeautifulSoup(html)
for link in soup.select('a'):
text = link.getText().lower()
if keyword in text:
print link['href']
这将以不区分大小写但对空间敏感的方式列出普通HTML文件中链接中的所有URL。
但是,如果您尝试解析列出的网站,他们会使用AJAX加载实际内容。您关联的网址实际上并不是您要查找的数据。该页面仅向https://familysearch.org/search/filmdatainfo
发送POST请求,其中包含有效负载:
{"type":"browse-data","args":{"waypointURL":"/recapi/waypoints/Q69L-N6T:116559001,116559002,116559003?cc=1601210","state":{"owc":"Q69L-N6T:116559001,116559002,116559003?cc=1601210","imageOrFilmUrl":"/search/image/index","viewMode":"i","selectedImageIndex":-1,"openWaypointContext":"/recapi/waypoints/Q69L-N6T:116559001,116559002,116559003?cc=1601210"},"locale":"en"}}
返回可以解析的JSON文档。他们似乎试图阻止您这样做,因此最容易使用Chrome"复制为cURL"得到这个的功能:
curl 'https://familysearch.org/search/filmdatainfo' -H 'Origin: https://familysearch.org' -H 'Accept-Encoding: gzip, deflate, br' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36' -H 'Content-Type: application/json' -H 'accept: application/json' -H 'Referer: https://familysearch.org/search/image/index?owc=Q69L-N6T%3A116559001%2C116559002%2C116559003%3Fcc%3D1601210' -H 'Cookie: fssessionid=USYS45D0C1B6E2A42A66B9E4C9F1D0935D2F_idses-prod05.a.fsglobal.net; fs_experiments=u%3D-anon-%2Ca%3Dshared-ui%2Cs%3D23d64fb841c59b75c0737db6b5dd47d0%2Cv%3D11111011110000000000000000000000000000000000000000000000000001011000%2Cb%3D49%26a%3Dsearch%2Cs%3D47d3688c3fc1adc06dc151194bb6e298%2Cv%3D110000001011001110100%2Cb%3D50; fs-tf=1' -H 'Connection: keep-alive' --data-binary '{"type":"browse-data","args":{"waypointURL":"/recapi/waypoints/Q69L-N6T:116559001,116559002,116559003?cc=1601210","state":{"owc":"Q69L-N6T:116559001,116559002,116559003?cc=1601210","imageOrFilmUrl":"/search/image/index","viewMode":"i","selectedImageIndex":-1,"openWaypointContext":"/recapi/waypoints/Q69L-N6T:116559001,116559002,116559003?cc=1601210"},"locale":"en"}}' --compressed
您可以管道传输到文件,然后加载:
import json
with open('data.json') as f:
x = json.load(f)
x
将是一个带有密钥containers
的字典,它是包含所有网址和标题的字典列表,每个都看起来像这样:
{"url":"https://www.familysearch.org/recapi/waypoints/Q69G-SJC:116559001,116559002,116559003,122762601?cc=1601210","title":"Matrimonios 1879-1888"}
你可以悠闲地过来。