我正在抓取此网页:https://zh.wikisource.org/wiki/%E8%AE%80%E9%80%9A%E9%91%92%E8%AB%96/%E5%8D%B701
代码:
import requests as r
from bs4 import BeautifulSoup as soup
webpages=['https://zh.wikisource.org/wiki/%E8%AE%80%E9%80%9A%E9%91%92%E8%AB%96/%E5%8D%B701']
for item in webpages:
headers = {'User-Agent': 'Mozilla/5.0'}
data = r.get(item, headers=headers)
data.encoding = 'utf-8'
page_soup = soup(data.text, 'html5lib')
headline = page_soup.find_all(class_='mw-headline')
for el in headline:
headline_text = el.get_text()
p = page_soup.find_all('p')
for el in p:
p_text = el.get_text()
text = headline_text + p_text
with open(r'sample_srape.txt', 'a', encoding='utf-8') as file:
file.write(text)
file.close()
输出txt文件仅显示headline_text + p_text
数据集的最后一组。似乎每当检索到新数据时,它就会覆盖前一组数据。如何使它停止覆盖以前的数据并显示所针对的每组数据?
答案 0 :(得分:2)
您需要a
来附加args。
我希望您的缩进在两个for循环中有所不同,因此您不会仅使用每次匹配的最后一项。如果要发出多个请求,则可以使用会话-重用连接可以提高效率。
在给定标题下段落的串联。在某些部分中,变量命名更清晰。
您不需要close
,因为这是由with
处理的。也许像这样:
import requests
from bs4 import BeautifulSoup as soup
webpages=['https://zh.wikisource.org/wiki/%E8%AE%80%E9%80%9A%E9%91%92%E8%AB%96/%E5%8D%B701']
headers = {'User-Agent': 'Mozilla/5.0'}
with requests.Session() as s:
for link in webpages:
data = s.get(link, headers=headers)
data.encoding = 'utf-8'
page_soup = soup(data.text, 'html5lib')
headlines = page_soup.find_all(class_='mw-headline')
with open(r'sample_scrape.txt', 'a', encoding='utf-8') as file:
for headline in headlines:
headline_text = headline.get_text()
paragraphs = page_soup.find_all('p')
text = ''
for paragraph in paragraphs:
paragraph_text = paragraph.get_text()
text+= paragraph_text
text = headline_text + text
file.write(text)