我正在使用BeautifulSoup抓取多篇新闻文章的主体。我发现每篇新闻的主体都是这样组织的:
<div class="article-body resizeFont">
<p style="font-size: 100%;">
<p style="font-size: 100%;">
<p style="font-size: 100%;">
....
我编写此代码是为了首先抓取所有段落,然后将段落放入每篇文章中,然后将每篇文章放入“ myarticle”列表中。
for pagelink in pagelinks:
#get page text
page = requests.get(pagelink)
#parse with BeautifulSoup
soup = bs(page.text, 'lxml')
containerr = soup.find("div", class_=['entry-content', 'entry-content-read-more'])
articletext = containerr.find_all('p')
thearticle = [] # clear from the previous loop
paragraphtext = [] # clear from the previous loop
for paragraph in articletext:
text = paragraph.get_text()
paragraphtext.append(text)
# put paragraphs into a single article, and put all the articles into a list
myarticle.append(thearticle.append(paragraphtext))
print(myarticle)
输出错误,它会返回
[None]
[None, None]
[None, None, None]
[None, None, None, None]
[None, None, None, None, None]
[None, None, None, None, None, None]
[None, None, None, None, None, None, None]
[None, None, None, None, None, None, None, None]
[None, None, None, None, None, None, None, None, None]
[None, None, None, None, None, None, None, None, None, None]
我想知道我的代码哪一步出错了? (如果需要,我要抓的文章之一是https://www.startribune.com/asian-stocks-mixed-amid-china-tension-with-us-australia/570651732/?refresh=true)
任何建议将不胜感激!