我使用BeautifulSoup从文本文件中解析一些HTML。文本被写入如下字典:
websites = ["1"]
html_dict = {}
for website_id in websites:
with codecs.open("{}/final_output/raw_html.txt".format(website_id), encoding="utf8") as out:
get_raw_html = out.read().splitlines()
html_dict.update({website_id:get_raw_html})
我从html_dict = {}
解析HTML以查找包含<p>
标记的文字:
scraped = {}
for website_id in html_dict.keys():
scraped[website_id] = []
raw_html = html_dict[website_id]
for i in raw_html:
soup = BeautifulSoup(i, 'html.parser')
scrape_selected_tags = soup.find_all('p')
这就是html_dict
中的HTML:
<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>
问题是,BeautifulSoup似乎正在考虑换行并忽略第二行。因此,当我打印出scrape_selected_tags
时,输出是......
<p>Hey, this should be scraped</p>
当我期待整篇文章时。
我该如何避免这种情况?我试过在html_dict
中分割线条并且它似乎无法正常工作。提前谢谢。
答案 0 :(得分:1)
通过在阅读html文档时调用splitlines
,您可以在字符串列表中分解标记
相反,你应该读取字符串中的所有html。
websites = ["1"]
html_dict = {}
for website_id in websites:
with codecs.open("{}/final_output/raw_html.txt".format(website_id), encoding="utf8") as out:
get_raw_html = out.read()
html_dict.update({website_id:get_raw_html})
然后删除内部for循环,这样你就不会迭代该字符串。
scraped = {}
for website_id in html_dict.keys():
scraped[website_id] = []
raw_html = html_dict[website_id]
soup = BeautifulSoup(raw_html, 'html.parser')
scrape_selected_tags = soup.find_all('p')
BeautifulSoup可以处理标签内的换行符,让我举个例子:
html = '''<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.find_all('p'))
[<p>Hey, this should be scraped\nbut this part gets ignored for some reason.</p>]
但是如果你在多个BeautifulSoup
个对象中拆分一个标签:
html = '''<p>Hey, this should be scraped
but this part gets ignored for some reason.</p>'''
for line in html.splitlines():
soup = BeautifulSoup(line, 'html.parser')
print(soup.find_all('p'))
[<p>Hey, this should be scraped</p>]
[]