我需要剔除《世界报》(Le Monde)报纸存档(自1980年以来)中有关自闭症的所有标题。 我不是程序员,而是试图成为“数字”的人道主义者...
我设法获得了所有(每日)问题的列表,并且从另一面看,一次用汤解析了一个URL,并提取了标题。但是两者都没有。 我认为我的问题在解析+迭代步骤上,但是无法解决。
from bs4 import BeautifulSoup
import requests
import re
from datetime import date, timedelta
start = date(2018, 1, 1)
end = date.today()
all_url =[]
#this chunk is working and returns a nice list of all url of all issues
day = timedelta(days=1)
one_url = "https://www.lemonde.fr/archives-du-monde/"
mydate = start
while mydate < end:
mydate += day
if one_url not in all_url:
all_url.append(one_url + "{date.day:02}/{date.month:02}/{date.year}".format(date=mydate) + '/')
#this function is working as well when applied with one single url
def titles(all_url):
for url in all_url:
page = BeautifulSoup(requests.get(url).text, "lxml")
regexp = re.compile(r'^.*\b(autisme|Autisme)\b.*$')
for headlines in page.find_all("h3"):
h = headlines.text
for m in regexp.finditer(h):
答案 0 :(得分:1)
当我实际打开其中一个Web链接(https://www.lemonde.fr/archives-du-monde/25/03/2018/)时,服务器响应404,因为此页面在服务器上不存在。 由于您使用代码创建了页面网址,因此这些链接很可能在服务器端与之对应。
from bs4 import BeautifulSoup
import requests
import re
from datetime import date, timedelta
start = date(2018, 1, 1)
end = date.today()
all_url =[]
#this chunk is working and returns a nice list of all url of all issues
day = timedelta(days=1)
one_url = "https://www.lemonde.fr/archives-du-monde/"
mydate = start
while mydate < end:
mydate += day
if one_url not in all_url:
all_url.append(one_url + "{date.day:02}/{date.month:02}/{date.year}".format(date=mydate) + '/')
#this function is working as well when applied with one single url
def titles(all_url):
counter = 0
for url in all_url:
print("[+] (" + str(counter) + ") Fetching URL " + url)
counter += 1
page = BeautifulSoup(requests.get(url).text, "lxml")
regexp = re.compile(r'^.*\b(autisme|Autisme)\b.*$')
found = False
for headlines in page.find_all("h3"):
h = headlines.text
for m in regexp.finditer(h):
found = True
if not found:
print("[-] Can't Find any thing relevant this page....")
[+] (0) Fetching URL https://www.lemonde.fr/archives-du-monde/02/01/2018/
[-] Can't Find any thing relevant this page....
[+] (1) Fetching URL https://www.lemonde.fr/archives-du-monde/03/01/2018/
[-] Can't Find any thing relevant this page....
[+] (2) Fetching URL https://www.lemonde.fr/archives-du-monde/04/01/2018/
[-] Can't Find any thing relevant this page....
[+] (3) Fetching URL https://www.lemonde.fr/archives-du-monde/05/01/2018/
[-] Can't Find any thing relevant this page....
[+] (4) Fetching URL https://www.lemonde.fr/archives-du-monde/06/01/2018/
[-] Can't Find any thing relevant this page....
[+] (5) Fetching URL https://www.lemonde.fr/archives-du-monde/07/01/2018/
[-] Can't Find any thing relevant this page....
[+] (6) Fetching URL https://www.lemonde.fr/archives-du-monde/08/01/2018/
[-] Can't Find any thing relevant this page....
[+] (7) Fetching URL https://www.lemonde.fr/archives-du-monde/09/01/2018/
[-] Can't Find any thing relevant this page....
[+] (8) Fetching URL https://www.lemonde.fr/archives-du-monde/10/01/2018/
[-] Can't Find any thing relevant this page....
[+] (9) Fetching URL https://www.lemonde.fr/archives-du-monde/11/01/2018/
[-] Can't Find any thing relevant this page....
[+] (10) Fetching URL https://www.lemonde.fr/archives-du-monde/12/01/2018/
[-] Can't Find any thing relevant this page....
答案 1 :(得分:0)
all_url.append(one_url + "{date.day:02}/{date.month:02}/{date.year}".format(date=mydate) + '/')
all_url.append(one_url + "{date.day:02}-{date.month:02}-{date.year}".format(date=mydate) + '/')
程序卡住的感觉仅仅是由于缺乏反馈。 @Zaid's answer展示了如何以一种优雅的方式解决这个问题。
import re
from datetime import date
from datetime import timedelta
import scrapy
BASE_URL = 'https://www.lemonde.fr/archives-du-monde/'
def date_range(start, stop):
for d in range((stop - start).days):
yield start + timedelta(days=d)
class LeMonde(scrapy.Spider):
name = 'LeMonde'
def start_requests(self):
for day in date_range(date(2018, 1, 1), date.today()):
url = BASE_URL + '{d.day:02}-{d.month:02}-{d.year}'.format(d=day) + '/'
yield scrapy.Request(url)
def parse(self, response):
for headline in response.xpath('//h3/a/text()').getall():
headline = headline.strip()
if 'autism' in headline.lower():
yield { 'headline': headline }
scrapy runspider spider_file.py -o headlines.csv