Question

我正在尝试仅从this主页上抓取某些文章。更具体地说，我试图仅从子页面媒体和子子页面Press releases中抓取文章； Governing Council decisions; Press conferences; Monetary policy accounts; Speeches; Interviews，以及只有英文的那些。

我设法（基于一些教程和其他SE：overflow答案）将一个代码完整地抓取了网站上的所有内容，因为我最初的想法是先抓取所有内容，然后在数据框架中稍后清除输出，但是网站内容太多，以至于一段时间后总会冻结。

获取子链接：

import requests
import re
from bs4 import BeautifulSoup
master_request = requests.get("https://www.ecb.europa.eu/")
base_url = "https://www.ecb.europa.eu"
master_soup = BeautifulSoup(master_request.content, 'html.parser')
master_atags = master_soup.find_all("a", href=True)
master_links = [ ] 
sub_links = {}
for master_atag in master_atags:
    master_href = master_atag.get('href')
    master_href = base_url + master_href
    print(master_href)
    master_links.append(master_href)
    sub_request = requests.get(master_href)
    sub_soup = BeautifulSoup(sub_request.content, 'html.parser')
    sub_atags = sub_soup.find_all("a", href=True)
    sub_links[master_href] = []
    for sub_atag in sub_atags:
        sub_href = sub_atag.get('href')
        sub_links[master_href].append(sub_href)
        print("\t"+sub_href)

我尝试过的一些事情是将基本链接更改为子链接-我的想法是，也许我可以对每个子页面分别进行操作，之后再将链接放在一起，但这是行不通的。我尝试的其他事情是用以下内容替换第17行；

sub_atags = sub_soup.find_all("a",{'class': ['doc-title']}, herf=True)

这似乎部分解决了我的问题，因为即使它没有仅从子页面获得链接，但它至少忽略了不是“ doc-title”的链接，这些链接都是网站上带有文本的链接，但是仍然太多，并且某些链接未正确检索。

我也尝试了以下方法：

for master_atag in master_atags:
    master_href = master_atag.get('href')
    for href in master_href:
        master_href = [base_url + master_href if str(master_href).find(".en") in master_herf
    print(master_href)

我认为，因为所有带有英文文档的href都在其中包含.en，这只会为我提供在href中某处出现.en的所有链接，但是此代码使我无法理解print（master_href）的语法错误因为以前的print（master_href）起作用了。

下一步，我想从子链接中提取以下信息。当我针对单个链接进行测试时，这部分代码可以工作，但是由于它无法完成运行，因此我从未有机会在上面的代码上进行尝试。一旦我设法获得所有链接的正确列表，这项工作会成功吗？

for links in sublinks:
    resp = requests.get(sublinks)
    soup = BeautifulSoup(resp.content, 'html5lib')
    article = soup.find('article')
    title = soup.find('title')
    textdate = soup.find('h2')
    paragraphs = article.find_all('p')
    matches = re.findall('(\d{2}[\/ ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[\/ ]\d{2,4})', str(textdate))
        for match in matches:
        print(match[0])
        datadate = match[0]
import pandas as pd
ecbdf = pd.DataFrame({"Article": [Article]; "Title": [title]: "Text": [paragraphs], "date": datadate})

也要回过头来，因为第一种使用漂亮汤的方法对我没有用，所以我也尝试以不同的方式解决问题。该网站具有RSS feed，因此我想使用以下代码：

import feedparser
from pandas.io.json import json_normalize
import pandas as pd
import requests
rss_url='https://www.ecb.europa.eu/home/html/rss.en.html'
ecb_feed = feedparser.parse(rss_url) 
df_ecb_feed=json_normalize(ecb_feed.entries)
df_ecb_fead.head()

在这里，我遇到了一个问题，就是甚至无法首先找到RSS feed URL。我尝试了以下操作：我查看了源页面，并尝试搜索“ RSS”，并尝试了所有可以通过这种方式找到的URL，但是我总是得到空的数据框。

我是网络爬虫的初学者，目前我不知道如何进行或如何解决此问题。最后，我要完成的工作就是从子页面中收集所有带有标题，日期和作者的文章，并将它们放入一个数据框中。

Answer 1

您在抓取该网站时遇到的最大问题可能是延迟加载：使用JavaScript，它们从多个html页面加载文章并将它们合并到列表中。有关详细信息，请在源代码中查找index_include。这对于仅使用请求和BeautifulSoup进行抓取是有问题的，因为您的汤实例从请求内容中得到的只是基本框架，而没有文章列表。现在，您有两个选择：

使用懒惰加载的文章列表，例如/press/pr/date/2019/html/index_include.en.html，而不是主要的文章列表页面（新闻稿，采访等）。这可能是比较容易的选择，但您必须对感兴趣的每一年都这样做。
使用可以执行诸如Selenium之类的JavaScript的客户端来获取HTML而不是请求。

除此之外，我建议使用CSS选择器从HTML代码中提取信息。这样，您只需要为文章做几行。另外，如果您使用index.en.html页面进行抓取，我认为您不必过滤英文文章，因为默认情况下它显示的是英语，如果可能，还会显示其他语言。

这是我快速整理的一个示例，可以肯定地对其进行优化，但是它显示了如何使用Selenium加载页面并提取文章URL和文章内容：

from bs4 import BeautifulSoup
from selenium import webdriver

base_url = 'https://www.ecb.europa.eu'
urls = [
    f'{base_url}/press/pr/html/index.en.html',
    f'{base_url}/press/govcdec/html/index.en.html'
]
driver = webdriver.Chrome()

for url in urls:
    driver.get(url)
    soup = BeautifulSoup(driver.page_source, 'html.parser')

    for anchor in soup.select('span.doc-title > a[href]'):
        driver.get(f'{base_url}{anchor["href"]}')
        article_soup = BeautifulSoup(driver.page_source, 'html.parser')

        title = article_soup.select_one('h1.ecb-pressContentTitle').text
        date = article_soup.select_one('p.ecb-publicationDate').text
        paragraphs = article_soup.select('div.ecb-pressContent > article > p:not([class])')
        content = '\n\n'.join(p.text for p in paragraphs)

        print(f'title: {title}')
        print(f'date: {date}')
        print(f'content: {content[0:80]}...')

我在“新闻稿”页面上获得以下输出：

title: ECB appoints Petra Senkovic as Director General Secretariat and Pedro Gustavo Teixeira as Director General Secretariat to the Supervisory Board                         
date: 20 December 2019                                    
content: The European Central Bank (ECB) today announced the appointments of Petra Senkov...

title: Monetary policy decisions                          
date: 12 December 2019                                    
content: At today’s meeting the Governing Council of the European Central Bank (ECB) deci...

找到正确的要素来抓取网站

1 个答案: