我在一家报纸网站上搜索了一个关键字(网络安全),结果显示了大约10篇文章。我希望我的代码抓住该链接并转到该链接,以获取整篇文章,并对页面中的所有10篇文章重复此过程。 (我不需要摘要,我想要整篇文章)
import urllib.request
import ssl
import time
from bs4 import BeautifulSoup
ssl._create_default_https_context = ssl._create_unverified_context
pages = [1]
for page in pages:
data = urllib.request.urlopen("https://www.japantimes.co.jp/tag/cybersecurity/page/{}".format(page))
soup = BeautifulSoup(data, 'html.parser')
for article in soup.find_all('div', class_="content_col"):
link = article.p.find('a')
print(link.attrs['href'])
for link in links:
headline = link.h1.find('div', class_= "padding_block")
headline = headline.text
print(headline)
content = link.p.find_all('div', class_= "entry")
content = content.text
print(content)
print()
time.sleep(3)
这不起作用。
date = link.li.find('time', class_= "post_time")
显示错误:
AttributeError:'NoneType'对象没有属性'find'
此代码正在运行,并获取所有文章链接。我想包含将在每个文章链接中添加标题和内容的代码。
import urllib.request
import ssl
import time
from bs4 import BeautifulSoup
ssl._create_default_https_context = ssl._create_unverified_context
pages = [1]
for page in pages:
data = urllib.request.urlopen("https://www.japantimes.co.jp/tag/cybersecurity/page/{}".format(page))
soup = BeautifulSoup(data, 'html.parser')
for article in soup.find_all('div', class_="content_col"):
link = article.p.find('a')
print(link.attrs['href'])
print()
time.sleep(3)
答案 0 :(得分:2)
尝试以下脚本。它将为您获取所有标题及其内容。输入您想浏览的最大页面数。
import requests
from bs4 import BeautifulSoup
url = 'https://www.japantimes.co.jp/tag/cybersecurity/page/{}'
pages = 4
for page in range(1,pages+1):
res = requests.get(url.format(page))
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".content_col header p > a"):
resp = requests.get(item.get("href"))
sauce = BeautifulSoup(resp.text,"lxml")
title = sauce.select_one("header h1").text
content = [elem.text for elem in sauce.select("#jtarticle p")]
print(f'{title}\n{content}\n')