Question

我使用BeautifulSoup从文本文件中解析一些HTML。文本被写入如下字典：

websites = ["1"]

html_dict = {}

for website_id in websites:
    with codecs.open("{}/final_output/raw_html.txt".format(website_id), encoding="utf8") as out:   
        get_raw_html = out.read().splitlines()
        html_dict.update({website_id:get_raw_html})

我从html_dict = {}解析HTML以查找包含标记的文字：

scraped = {}

for website_id in html_dict.keys():
    scraped[website_id] = []
    raw_html = html_dict[website_id]
    for i in raw_html:
        soup = BeautifulSoup(i, 'html.parser')
        scrape_selected_tags = soup.find_all('p')

这就是html_dict中的HTML：

<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>

问题是，BeautifulSoup似乎正在考虑换行并忽略第二行。因此，当我打印出scrape_selected_tags时，输出是......

<p>Hey, this should be scraped</p>

当我期待整篇文章时。

我该如何避免这种情况？我试过在html_dict中分割线条并且它似乎无法正常工作。提前谢谢。

Answer 1

通过在阅读html文档时调用splitlines，您可以在字符串列表中分解标记相反，你应该读取字符串中的所有html。

websites = ["1"]
html_dict = {}

for website_id in websites:
    with codecs.open("{}/final_output/raw_html.txt".format(website_id), encoding="utf8") as out:   
        get_raw_html = out.read()
        html_dict.update({website_id:get_raw_html})

然后删除内部for循环，这样你就不会迭代该字符串。

scraped = {}

for website_id in html_dict.keys():
    scraped[website_id] = []
    raw_html = html_dict[website_id]
    soup = BeautifulSoup(raw_html, 'html.parser')
    scrape_selected_tags = soup.find_all('p')

BeautifulSoup可以处理标签内的换行符，让我举个例子：

html = '''<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>'''

soup = BeautifulSoup(html, 'html.parser')
print(soup.find_all('p'))

[Hey, this should be scraped\nbut this part gets ignored for some reason.]

但是如果你在多个BeautifulSoup个对象中拆分一个标签：

html = '''<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>'''

for line in html.splitlines():
    soup = BeautifulSoup(line, 'html.parser')
    print(soup.find_all('p'))

[Hey, this should be scraped]
[]

Beautifulsoup解析html换行符

1 个答案: