排除HTML元素(递归= False无效)

时间:2019-06-26 10:10:37

标签: python web-scraping beautifulsoup

这个想法是,检查德国医疗新闻页面的最后3页。在每个页面上都是5,其中包含指向单独文章的链接。该程序检查data.csv中是否已经存在每个“ href”。如果没有,它将“ href”添加到data.csv,并跟随链接并将内容保存到.html文件。

每个文章页面的内容为:

<html>
..
..
<div class="newstext">
 <p> article-piece 1</p>
 <p> article-piece 2</p>
 <p> article-piece 3</p>
 <div class="URLkastenWrapper">
  <div class="newsKasten URLkasten newsKastenLinks">
   <p> not wanted stuff</p>
  </div>
 </div>
 <p> article-piece 4</p>
 <p> article-piece 5</p>
</div>

我想将“文章”保存到html中,并排除“不需要的东西”。

我尝试使用recursive=False,如我的代码所示。 就我的研究而言,这是达到我的目标的路,对吗?

但是由于某些原因,它不起作用:(

import requests
from bs4 import BeautifulSoup
import mechanicalsoup

# this requests the first 3 news-Pages; each of them contains 5 articles
scan_med_news = ['https://www.aerzteblatt.de/nachrichten/Medizin?page=1', 'https://www.aerzteblatt.de/nachrichten/Medizin?page=2', 'https://www.aerzteblatt.de/nachrichten/Medizin?page=3']

# This function is ment to create an html-file with the Article-pices of the web-site.
def article_html_create(title, url):
    with open(title+'.html', 'a+') as article:
        article.write('<h1>'+title+'</h1>\n\n')
        subpage = BeautifulSoup(requests.get(url).text, 'html5lib')
        for line in subpage.select('.newstext p', recursive=False):
            #this recursive:False is not working as i wish
            article.write(line.text+'<br><br>')

# this piece of code takes the URLs of allready saved articles and puts them from an .csv in a list
contentlist = []
with open('data.csv', "r") as file:
    for line in file:
        for item in line.strip().split(','):
            contentlist.append(item)

# for every article on these pages, it checks, if the url is in the contenlist, created from the date.csv
with open('data.csv', 'a') as file:
    for page in scan_med_news:
        doc = requests.get(page)
        doc.encoding = 'utf-8'
        soup = BeautifulSoup(doc.text, 'html5lib')
        for h2 in soup.find_all('h2'):
            for a in h2.find_all('a',):
                if a['href'] in contentlist:
                    # if the url is already in the list, it prints "Already existing"
                    print('Already existing')
                else:
                    # if the url is not in the list, it adds the url to the date.csv and starts the article_html_create-function to save the content of the article
                    file.write(a['href']+',')
                   article_html_create(a.text, 'https://www.aerzteblatt.de'+a['href'])
                    print('Added to the file!')

2 个答案:

答案 0 :(得分:0)

尝试一下,看看是否可行。只需更改:

for line in subpage.select('.newstext p', recursive=False):
        #this recursive:False is not working as i wish
        article.write(line.text+'<br><br>')

 for line in subpage.select('.newstext > p '):
                   article.write(line.text+'<br><br>')

我的输出是(使用上面的html代码段和print而不是article.write):

  

文章1

文章2

文章   3

文章4

文章5

答案 1 :(得分:0)

您可以选择不需要的 div节点的父p节点,并将string属性设置为空字符串,这将使父母的孩子应从汤中取出。然后,您可以定期进行选择。

示例:

In [17]: soup = BeautifulSoup(html, 'lxml')

In [18]: soup
Out[18]: 
<html><body><div class="newstext">
<p> article-piece 1</p>
<p> article-piece 2</p>
<p> article-piece 3</p>
<div class="URLkastenWrapper">
<div class="newsKasten URLkasten newsKastenLinks">
<p> not wanted stuff</p>
</div>
</div>
<p> article-piece 4</p>
<p> article-piece 5</p>
</div></body></html>

In [19]: soup.select_one('.URLkastenWrapper').string = ''

In [20]: soup
Out[20]: 
<html><body><div class="newstext">
<p> article-piece 1</p>
<p> article-piece 2</p>
<p> article-piece 3</p>
<div class="URLkastenWrapper"></div>
<p> article-piece 4</p>
<p> article-piece 5</p>
</div></body></html>

In [21]: soup.select('.newstext p')
Out[21]: 
[<p> article-piece 1</p>,
 <p> article-piece 2</p>,
 <p> article-piece 3</p>,
 <p> article-piece 4</p>,
 <p> article-piece 5</p>]