Question

我正在使用BeautifulSoup抓取多篇新闻文章的主体。我发现每篇新闻的主体都是这样组织的：

<div class="article-body resizeFont">                                                                                                                                                                                                                                                                                                                                                                                                                                       
    <p style="font-size: 100%;">
    <p style="font-size: 100%;">
    <p style="font-size: 100%;">
    ....

我编写此代码是为了首先抓取所有段落，然后将段落放入每篇文章中，然后将每篇文章放入“ myarticle”列表中。

            for pagelink in pagelinks:
                #get page text
                page = requests.get(pagelink)
                #parse with BeautifulSoup
                soup = bs(page.text, 'lxml')
                containerr = soup.find("div", class_=['entry-content', 'entry-content-read-more'])
                articletext = containerr.find_all('p')
                thearticle = [] # clear from the previous loop
                paragraphtext = [] # clear from the previous loop
                for paragraph in articletext:
                    text = paragraph.get_text()
                    paragraphtext.append(text)
                # put paragraphs into a single article, and put all the articles into a list
                myarticle.append(thearticle.append(paragraphtext))
                print(myarticle)

输出错误，它会返回

[None]
[None, None]
[None, None, None]
[None, None, None, None]
[None, None, None, None, None]
[None, None, None, None, None, None]
[None, None, None, None, None, None, None]
[None, None, None, None, None, None, None, None]
[None, None, None, None, None, None, None, None, None]
[None, None, None, None, None, None, None, None, None, None]

我想知道我的代码哪一步出错了？（如果需要，我要抓的文章之一是https://www.startribune.com/asian-stocks-mixed-amid-china-tension-with-us-australia/570651732/?refresh=true）

任何建议将不胜感激！

使用BeautifulSoup查找段落并合并段落

0 个答案: