用于从给定网站获取网址的小python 3脚本

时间:2017-07-24 13:07:56

标签: python url fetch

我喜欢从python 3编写的网站获得某些链接。我曾尝试自己编写但我失败了(成为初学者)。

我希望脚本能够执行以下操作:

  1. 向我询问网址(即https://familysearch.org/search/image/index?owc=Q69L-N6T%3A116559001%2C116559002%2C116559003%3Fcc%3D1601210)。
  2. 向我询问关键字(不区分大小写,但对空间敏感!),如“matrimonios 2000”,以获取指定网站的相应链接。
  3. 使用链接名称中的“matrimonios 2000”获取所有网址(在此示例中) 它将是27个名为“Matrimonios 2000 vol 1”的网址,直到     “Matrimonios 2000 vol 14”)。
  4. 在名为“urls.txt”的文件中逐行保存相应的网址 脚本运行的同一文件夹。
  5. 到目前为止,这是我的代码:

    #!/usr/bin/env python3
    
    import urllib2
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    
    url = input('Please, enter url: ')
    try:
        keyword = string(input('Type keyword(s): '))
    except ValueError:
        print('You must enter a string value.')
    
    driver = webdriver.Firefox()
    urls = driver.find_elements_by_xpath('keyword')
    for url in urls:
        print url.get_attribute("href")
    
    file = open('urls.txt', 'w')
    f.write(url)
    f.close()
    

1 个答案:

答案 0 :(得分:1)

一般的答案是:

url = 'https://familysearch.org/search/image/index?owc=Q69L-N6T%3A116559001%2C116559002%2C116559003%3Fcc%3D1601210'
keyword = 'matrimonios 2000'

html = requests.get(url).content
soup = BeautifulSoup(html)
for link in soup.select('a'):
  text = link.getText().lower()
  if keyword in text:
    print link['href']

这将以不区分大小写但对空间敏感的方式列出普通HTML文件中链接中的所有URL。

但是,如果您尝试解析列出的网站,他们会使用AJAX加载实际内容。您关联的网址实际上并不是您要查找的数据。该页面仅向https://familysearch.org/search/filmdatainfo发送POST请求,其中包含有效负载:

{"type":"browse-data","args":{"waypointURL":"/recapi/waypoints/Q69L-N6T:116559001,116559002,116559003?cc=1601210","state":{"owc":"Q69L-N6T:116559001,116559002,116559003?cc=1601210","imageOrFilmUrl":"/search/image/index","viewMode":"i","selectedImageIndex":-1,"openWaypointContext":"/recapi/waypoints/Q69L-N6T:116559001,116559002,116559003?cc=1601210"},"locale":"en"}}

返回可以解析的JSON文档。他们似乎试图阻止您这样做,因此最容易使用Chrome"复制为cURL"得到这个的功能:

curl 'https://familysearch.org/search/filmdatainfo' -H 'Origin: https://familysearch.org' -H 'Accept-Encoding: gzip, deflate, br' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36' -H 'Content-Type: application/json' -H 'accept: application/json' -H 'Referer: https://familysearch.org/search/image/index?owc=Q69L-N6T%3A116559001%2C116559002%2C116559003%3Fcc%3D1601210' -H 'Cookie: fssessionid=USYS45D0C1B6E2A42A66B9E4C9F1D0935D2F_idses-prod05.a.fsglobal.net; fs_experiments=u%3D-anon-%2Ca%3Dshared-ui%2Cs%3D23d64fb841c59b75c0737db6b5dd47d0%2Cv%3D11111011110000000000000000000000000000000000000000000000000001011000%2Cb%3D49%26a%3Dsearch%2Cs%3D47d3688c3fc1adc06dc151194bb6e298%2Cv%3D110000001011001110100%2Cb%3D50; fs-tf=1' -H 'Connection: keep-alive' --data-binary '{"type":"browse-data","args":{"waypointURL":"/recapi/waypoints/Q69L-N6T:116559001,116559002,116559003?cc=1601210","state":{"owc":"Q69L-N6T:116559001,116559002,116559003?cc=1601210","imageOrFilmUrl":"/search/image/index","viewMode":"i","selectedImageIndex":-1,"openWaypointContext":"/recapi/waypoints/Q69L-N6T:116559001,116559002,116559003?cc=1601210"},"locale":"en"}}' --compressed

您可以管道传输到文件,然后加载:

import json
with open('data.json') as f:
  x = json.load(f)

x将是一个带有密钥containers的字典,它是包含所有网址和标题的字典列表,每个都看起来像这样:

{"url":"https://www.familysearch.org/recapi/waypoints/Q69G-SJC:116559001,116559002,116559003,122762601?cc=1601210","title":"Matrimonios 1879-1888"}

你可以悠闲地过来。