Question

我需要剔除《世界报》（Le Monde）报纸存档（自1980年以来）中有关自闭症的所有标题。我不是程序员，而是试图成为“数字”的人道主义者...

我设法获得了所有（每日）问题的列表，并且从另一面看，一次用汤解析了一个URL，并提取了标题。但是两者都没有。我认为我的问题在解析+迭代步骤上，但是无法解决。

from bs4 import BeautifulSoup
import requests
import re
from datetime import date, timedelta

start = date(2018, 1, 1)
end = date.today()
all_url =[]

#this chunk is working and returns a nice list of all url of all issues
day = timedelta(days=1)
one_url = "https://www.lemonde.fr/archives-du-monde/"
mydate = start

while mydate < end:
    mydate += day
    if one_url not in all_url:
        all_url.append(one_url + "{date.day:02}/{date.month:02}/{date.year}".format(date=mydate) + '/')

#this function is working as well when applied with one single url
def titles(all_url):

    for url in all_url:
        page = BeautifulSoup(requests.get(url).text, "lxml")

        regexp = re.compile(r'^.*\b(autisme|Autisme)\b.*$')

        for headlines in page.find_all("h3"):
            h = headlines.text

            for m in regexp.finditer(h):
                print(m.group())

titles(all_url)

这个脚本只是卡住了...

Answer 1

脚本未卡死。我添加了打印语句，以便您可以直观地看到它正在工作。但是起初我认为可能是您的正则表达式模式存在问题。

当我实际打开其中一个Web链接（https://www.lemonde.fr/archives-du-monde/25/03/2018/）时，服务器响应404，因为此页面在服务器上不存在。由于您使用代码创建了页面网址，因此这些链接很可能在服务器端与之对应。

from bs4 import BeautifulSoup
import requests
import re
from datetime import date, timedelta

start = date(2018, 1, 1)
end = date.today()
all_url =[]

#this chunk is working and returns a nice list of all url of all issues
day = timedelta(days=1)
one_url = "https://www.lemonde.fr/archives-du-monde/"
mydate = start

while mydate < end:
    mydate += day
    if one_url not in all_url:
        all_url.append(one_url + "{date.day:02}/{date.month:02}/{date.year}".format(date=mydate) + '/')

#this function is working as well when applied with one single url
def titles(all_url):

    counter = 0
    for url in all_url:
        print("[+] (" + str(counter) + ") Fetching URL " + url)
        counter += 1
        page = BeautifulSoup(requests.get(url).text, "lxml")

        regexp = re.compile(r'^.*\b(autisme|Autisme)\b.*$')

        found = False
        for headlines in page.find_all("h3"):
            h = headlines.text

            for m in regexp.finditer(h):
                found = True
                print(m.group())

        if not found:
            print("[-] Can't Find any thing relevant this page....")
            print()

titles(all_url)

脚本输出：

[+] (0) Fetching URL https://www.lemonde.fr/archives-du-monde/02/01/2018/
[-] Can't Find any thing relevant this page....

[+] (1) Fetching URL https://www.lemonde.fr/archives-du-monde/03/01/2018/
[-] Can't Find any thing relevant this page....

[+] (2) Fetching URL https://www.lemonde.fr/archives-du-monde/04/01/2018/
[-] Can't Find any thing relevant this page....

[+] (3) Fetching URL https://www.lemonde.fr/archives-du-monde/05/01/2018/
[-] Can't Find any thing relevant this page....

[+] (4) Fetching URL https://www.lemonde.fr/archives-du-monde/06/01/2018/
[-] Can't Find any thing relevant this page....

[+] (5) Fetching URL https://www.lemonde.fr/archives-du-monde/07/01/2018/
[-] Can't Find any thing relevant this page....

[+] (6) Fetching URL https://www.lemonde.fr/archives-du-monde/08/01/2018/
[-] Can't Find any thing relevant this page....

[+] (7) Fetching URL https://www.lemonde.fr/archives-du-monde/09/01/2018/
[-] Can't Find any thing relevant this page....

[+] (8) Fetching URL https://www.lemonde.fr/archives-du-monde/10/01/2018/
[-] Can't Find any thing relevant this page....

[+] (9) Fetching URL https://www.lemonde.fr/archives-du-monde/11/01/2018/
[-] Can't Find any thing relevant this page....

[+] (10) Fetching URL https://www.lemonde.fr/archives-du-monde/12/01/2018/
[-] Can't Find any thing relevant this page....

您可以通过在Web浏览器中进行检查来查看每个URL。请让我知道是否需要更多帮助。

Answer 2

主要问题在于，《世界报》的存档网址中使用的日期格式为day-month-year，而不是day/month/year。要解决它，请更改：

all_url.append(one_url + "{date.day:02}/{date.month:02}/{date.year}".format(date=mydate) + '/')

到

all_url.append(one_url + "{date.day:02}-{date.month:02}-{date.year}".format(date=mydate) + '/')

程序卡住的感觉仅仅是由于缺乏反馈。 @Zaid's answer展示了如何以一种优雅的方式解决这个问题。

如果您需要更快速的方法来发出一堆HTTP请求，则应考虑使用异步方法。我建议使用Scrapy，这是为此类任务（网页抓取）而构建的框架。

我做了一个简单的蜘蛛程序，以获取档案中包含'autism'的所有标题（从2018年初到今天）：

import re
from datetime import date
from datetime import timedelta

import scrapy

BASE_URL = 'https://www.lemonde.fr/archives-du-monde/'


def date_range(start, stop):
    for d in range((stop - start).days):
        yield start + timedelta(days=d)


class LeMonde(scrapy.Spider):
    name = 'LeMonde'

    def start_requests(self):
        for day in date_range(date(2018, 1, 1), date.today()):
            url = BASE_URL + '{d.day:02}-{d.month:02}-{d.year}'.format(d=day) + '/'
            yield scrapy.Request(url)

    def parse(self, response):
        for headline in response.xpath('//h3/a/text()').getall():
            headline = headline.strip()

            if 'autism' in headline.lower():
                yield { 'headline': headline }

使用上述代码，我能够在47秒内取消标题。如果您有兴趣，可以运行：

scrapy runspider spider_file.py -o headlines.csv

这将生成一个包含标题的csv文件（headlines.csv）。

通过URL列表解析美丽的汤

2 个答案: