如何废弃本网站上的日期链接:https://flight-data.adsbexchange.com/activity?inputSelect =registration®istration= N12345

时间:2017-11-15 02:41:18

标签: python web-scraping screen-scraping

我正在尝试打印本网站底部的链接列表中的日期。我不知道出了什么问题,因为没有错误闪现。我尝试过更简单的方法,适用于纽约时代的网站,以检索所有的href。但是这些没有用,所以我调查了用户代理。

import urllib
import lxml.html
import urllib2
from urllib import URLopener

URLopener.version
from urllib import FancyURLopener
class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
MyOpener.version
myopener = MyOpener()
page = myopener.open('https://flight-data.adsbexchange.com/activity?inputSelect=registration&registration=N12345')  
page.read()
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "lxml")
for line in soup.find_all('a'):
    print(line.get('href'))

1 个答案:

答案 0 :(得分:0)

执行以下脚本。它会为您提供所有想要的链接:

from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests

page_url = "https://flight-data.adsbexchange.com/activity?inputSelect=registration&registration=N12345"
page = requests.get(page_url).text
soup = BeautifulSoup(page, "lxml")
for items in soup.select(".dates"):
    print(urljoin(page_url,items['href']))

部分输出:

https://flight-data.adsbexchange.com/map?icao=A061D9&date=2017-11-14
https://flight-data.adsbexchange.com/map?icao=A061D9&date=2017-11-09
https://flight-data.adsbexchange.com/map?icao=A061D9&date=2017-11-08
https://flight-data.adsbexchange.com/map?icao=A061D9&date=2017-11-05
https://flight-data.adsbexchange.com/map?icao=A061D9&date=2017-10-31