Question

我正在抓取此网页：https://zh.wikisource.org/wiki/%E8%AE%80%E9%80%9A%E9%91%92%E8%AB%96/%E5%8D%B701

代码：

import requests as r
from bs4 import BeautifulSoup as soup

webpages=['https://zh.wikisource.org/wiki/%E8%AE%80%E9%80%9A%E9%91%92%E8%AB%96/%E5%8D%B701']

for item in webpages:
    headers = {'User-Agent': 'Mozilla/5.0'}
    data = r.get(item, headers=headers)
    data.encoding = 'utf-8'
    page_soup = soup(data.text, 'html5lib')
    headline = page_soup.find_all(class_='mw-headline')
    for el in headline:
        headline_text = el.get_text()
    p = page_soup.find_all('p')
    for el in p:
        p_text = el.get_text()
    text = headline_text + p_text
    with open(r'sample_srape.txt', 'a', encoding='utf-8') as file:
        file.write(text)
        file.close()

输出txt文件仅显示headline_text + p_text数据集的最后一组。似乎每当检索到新数据时，它就会覆盖前一组数据。如何使它停止覆盖以前的数据并显示所针对的每组数据？

Answer 1

您需要a来附加args。

我希望您的缩进在两个for循环中有所不同，因此您不会仅使用每次匹配的最后一项。如果要发出多个请求，则可以使用会话-重用连接可以提高效率。

在给定标题下段落的串联。在某些部分中，变量命名更清晰。

您不需要close，因为这是由with处理的。也许像这样：

import requests
from bs4 import BeautifulSoup as soup

webpages=['https://zh.wikisource.org/wiki/%E8%AE%80%E9%80%9A%E9%91%92%E8%AB%96/%E5%8D%B701']
headers = {'User-Agent': 'Mozilla/5.0'}

with requests.Session() as s:

    for link in webpages:
        data = s.get(link, headers=headers)
        data.encoding = 'utf-8'
        page_soup = soup(data.text, 'html5lib')
        headlines = page_soup.find_all(class_='mw-headline')

        with open(r'sample_scrape.txt', 'a', encoding='utf-8') as file:

            for headline in headlines:
                headline_text = headline.get_text()
                paragraphs = page_soup.find_all('p')
                text = ''

                for paragraph in paragraphs:
                    paragraph_text = paragraph.get_text()
                    text+= paragraph_text

                text = headline_text + text
                file.write(text)

BeautifulSoup4：find_all（）覆盖以前的数据集，而不是显示所有目标数据

1 个答案: