Beautifulsoup解析html换行符

时间:2018-01-16 19:19:22

标签: python beautifulsoup

我使用BeautifulSoup从文本文件中解析一些HTML。文本被写入如下字典:

websites = ["1"]

html_dict = {}

for website_id in websites:
    with codecs.open("{}/final_output/raw_html.txt".format(website_id), encoding="utf8") as out:   
        get_raw_html = out.read().splitlines()
        html_dict.update({website_id:get_raw_html})

我从html_dict = {}解析HTML以查找包含<p>标记的文字:

scraped = {}

for website_id in html_dict.keys():
    scraped[website_id] = []
    raw_html = html_dict[website_id]
    for i in raw_html:
        soup = BeautifulSoup(i, 'html.parser')
        scrape_selected_tags = soup.find_all('p')

这就是html_dict中的HTML:

<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>

问题是,BeautifulSoup似乎正在考虑换行并忽略第二行。因此,当我打印出scrape_selected_tags时,输出是......

<p>Hey, this should be scraped</p>

当我期待整篇文章时。

我该如何避免这种情况?我试过在html_dict中分割线条并且它似乎无法正常工作。提前谢谢。

1 个答案:

答案 0 :(得分:1)

通过在阅读html文档时调用splitlines,您可以在字符串列表中分解标记 相反,你应该读取字符串中的所有html。

websites = ["1"]
html_dict = {}

for website_id in websites:
    with codecs.open("{}/final_output/raw_html.txt".format(website_id), encoding="utf8") as out:   
        get_raw_html = out.read()
        html_dict.update({website_id:get_raw_html})

然后删除内部for循环,这样你就不会迭代该字符串。

scraped = {}

for website_id in html_dict.keys():
    scraped[website_id] = []
    raw_html = html_dict[website_id]
    soup = BeautifulSoup(raw_html, 'html.parser')
    scrape_selected_tags = soup.find_all('p')

BeautifulSoup可以处理标签内的换行符,让我举个例子:

html = '''<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>'''

soup = BeautifulSoup(html, 'html.parser')
print(soup.find_all('p'))
  

[<p>Hey, this should be scraped\nbut this part gets ignored for some reason.</p>]

但是如果你在多个BeautifulSoup个对象中拆分一个标签:

html = '''<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>'''

for line in html.splitlines():
    soup = BeautifulSoup(line, 'html.parser')
    print(soup.find_all('p'))
  

[<p>Hey, this should be scraped</p>]
  []