为什么我在用 Python 抓取这个网站时遇到麻烦?

时间:2021-04-12 13:28:21

标签: python web-scraping

我是 Python 新手,我正在尝试抓取这个 website。我想要做的是从这个网站上获取日期和文章的标题。我遵循我在 SO 上找到的程序,如下所示:

from bs4 import BeautifulSoup
import requests


url = "https://www.ecb.europa.eu/press/inter/html/index.en.html"
res = requests.get(url)
soup = BeautifulSoup(res.text)

movies = soup.select(".title a , .date")
print(movies)

movies_titles = [title.text for title in movies]
movies_links = ["http://www.ecb.europa.eu"+ title["href"] for title in movies]
print(movies_titles)
print(movies_links)

我在共享的 url 中使用 SelectorGadget 得到了 .title a , .date。但是,print(movies) 是空的。我做错了什么?

有人可以帮我吗?

谢谢!

2 个答案:

答案 0 :(得分:1)

内容不是index.en.html的一部分,而是由js

加载
https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html

那么你就不能选择成对了,所以你需要分别选择标题和日期:

titles = soup.select(".title a")
dates = soup.select(".date")
pairs = list(zip(titles, dates))

然后你可以像这样打印出来:

movies_titles = [pair[0].text for pair in pairs]
print(movies_titles)

movies_links = ["http://www.ecb.europa.eu" + pair[0]["href"] for pair in pairs]
print(movies_links)

结果:

['Christine Lagarde:\xa0Interview with CNBC', 'Fabio Panetta:\xa0Interview with El País ', 'Isabel Schnabel:\xa0Interview with Der Spiegel', 'Philip R. Lane:\xa0Interview with CNBC', 'Frank Elderson:\xa0Q&A on Twitter', 'Isabel Schnabel:\xa0Interview with Les Echos ', 'Philip R. Lane:\xa0Interview with the Financial Times', 'Luis de Guindos:\xa0Interview with Público', 'Philip R. Lane:\xa0Interview with Expansión', 'Isabel Schnabel:\xa0Interview with LETA', 'Fabio Panetta:\xa0Interview with Der Spiegel', 'Christine Lagarde:\xa0Interview with Le Journal du Dimanche ', 'Philip R. Lane:\xa0Interview with Süddeutsche Zeitung', 'Isabel Schnabel:\xa0Interview with Deutschlandfunk', 'Philip R. Lane:\xa0Interview with SKAI TV', 'Isabel Schnabel:\xa0Interview with Der Standard']
['http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210412~ccd1b7c9bf.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210411~44ade9c3b5.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210409~c8c348a12c.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210323~e4026c61d1.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210317_1~1d81212506.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210317~458636d643.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210316~930d09ce3c.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210302~c793ad7b68.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210226~79eba6f9fb.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210225~5f1be75a9f.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210209~af9c628e30.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210207~f6e34f3b90.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210131_1~650f5ce5f7.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210131~13d84cb9b2.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210127~9ad88eb038.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210112~1c3f989acd.en.html']

完整代码:

from bs4 import BeautifulSoup
import requests

url = "https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html"
res = requests.get(url)
soup = BeautifulSoup(res.text)

titles = soup.select(".title a")
dates = soup.select(".date")
pairs = list(zip(titles, dates))

movies_titles = [pair[0].text for pair in pairs]
print(movies_titles)

movies_links = ["http://www.ecb.europa.eu" + pair[0]["href"] for pair in pairs]
print(movies_links)

答案 1 :(得分:0)

我建议使用 Python Selenium

试试这样的:

from selenium.webdriver import Chrome

url = "https://www.ecb.europa.eu/press/inter/html/index.en.html"
browser = Chrome()
browser.get(url)
interviews = browser.find_elements_by_class_name('title')

links = []
for interview in  interviews:
    try:
        anchor = interview.find_element_by_tag_name('a')
        link = anchor.get_attribute('href')
        links.append(link)
    except NoSuchElementException:
        pass

链接将包含所有采访的链接。你可以对日期做类似的事情

相关问题