我已经成功抓取了标题和链接。
我想用链接中的主要文章替换摘要标签(因为标题和摘要始终相同。)
link = "https://www.vanglaini.org" + article.a['href']
(例如https://www.vanglaini.org/tualchhung/103834)
请帮助我修改代码。
下面是我的代码。
import pandas as pd
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.vanglaini.org/').text
soup = BeautifulSoup(source, 'lxml')
list_with_headlines = []
list_with_summaries = []
list_with_links = []
for article in soup.find_all('article'):
if article.a is None:
continue
headline = article.a.text.strip()
summary = article.p.text.strip()
link = "https://www.vanglaini.org" + article.a['href']
list_with_headlines.append(headline)
list_with_summaries.append(summary)
list_with_links.append(link)
news_csv = pd.DataFrame({
'Headline': list_with_headlines,
'Summary': list_with_summaries,
'Link' : list_with_links,
})
print(news_csv)
news_csv.to_csv('test.csv')
答案 0 :(得分:1)
只需在for循环内再次请求并获取标签文本即可。
import pandas as pd
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.vanglaini.org/').text
soup = BeautifulSoup(source, 'lxml')
list_with_headlines = []
list_with_summaries = []
list_with_links = []
for article in soup.find_all('article'):
if article.a is None:
continue
headline = article.a.text.strip()
link = "https://www.vanglaini.org" + article.a['href']
list_with_headlines.append(headline)
list_with_links.append(link)
soup = BeautifulSoup(requests.get(link).text, 'lxml')
list_with_summaries.append(soup.select_one(".pagesContent").text)
news_csv = pd.DataFrame({
'Headline': list_with_headlines,
'Summary': list_with_summaries,
'Link' : list_with_links,
})
print(news_csv)
news_csv.to_csv('test.csv')
Csv将是这样。