Question

我正在尝试创建一个程序，用于分析此新闻网站档案的每一页上每篇文章的正文。最初，我的程序在第32行停止，我打印了每个链接并将其保存到csv文件中，并且可以正常工作。现在，我想打开每个链接，并将文章的正文保存到csv文件中。我尝试使用与使用BeautifulSoup时相同的代码格式，但是现在我的代码什么也没打印。我不知道我的问题与我如何使用BeautifulSoup或如何从网站HTML编写标签有关。这是档案馆网站：https://www.politico.com/newsletters/playbook/archive（上面有408页）

from bs4 import BeautifulSoup
from urllib.request import urlopen

csvFile = 'C:/Users/k/Dropbox/Politico/pol.csv'
with open(csvFile, mode='w') as pol:
    csvwriter = csv.writer(pol, delimiter='|', quotechar='"', quoting=csv.QUOTE_MINIMAL)

    #for each page on Politico archive
    for p in range(0,409):
        url = urlopen("https://www.politico.com/newsletters/playbook/archive/%d" % p)
        content = url.read()

        #Parse article links from page
        soup = BeautifulSoup(content,"lxml")
        articleLinks = soup.findAll('article', attrs={'class':'story-frag format-l'})

        #Each article link on page
        for article in articleLinks:
            link = article.find('a', attrs={'target':'_top'}).get('href')

            #Open and read each article link
            articleURL = urlopen(link)
            articleContent = articleURL.read()

            #Parse body text from article page
            soupArticle = BeautifulSoup(articleContent, "lxml")

            #Limits to div class = story-text tag (where article text is)
            articleText = soup.findAll('div', attrs={'class':'story-text'})
            for div in articleText:
                #Limits to b tag (where the body text seems so exclusively be)
                bodyText = div.find('b')
                print(bodyText)

                #Adds article link to csv file
                csvwriter.writerow([bodyText])

我希望输出将每篇文章的正文打印到存档中，并将其全部保存到一个csv文件中。

Answer 1

它没有打印任何内容，因为您在articleText = soup.findAll('div', attrs={'class':'story-text'})处找错了地方

您将其存储为soupArticle，而不是soup。

您还想要文本还是html元素？照原样，您将获得标签/元素。如果只需要文字，则需要bodyText = div.find('b').text

但是主要问题是您想要更改：

articleText = soup.findAll('div', attrs={'class':'story-text'})

到

articleText = soupArticle.findAll('div', attrs={'class':'story-text'})

要获取完整的文章，您必须遍历p标记。并弄清楚如何跳过不需要的部分。有一种更好的方法，但是可以帮助您，像这样：

for article in articleLinks:
    link = article.find('a', attrs={'target':'_top'}).get('href')

     articleURL = urlopen(link)
     articleContent = articleURL.read()

     soupArticle = BeautifulSoup(articleContent, "lxml")
     articleText = soupArticle.findAll('div', attrs={'class':'story-text'})

     for div in articleText:
        bodyText = div.find_all('p')
        for para in bodyText:
            if 'By ' in para.text:
                continue
            print (para.text.strip())

如何修复从网站解析正文的python代码？

1 个答案: