使用BeautifulSoup解析html文档时,有时html代码会产生新行,例如
<div style="line-height:120%;text-align:left;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;"><br></font></div><div style="line-height:120%;text-align:left;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;font-weight:bold;">
因此,当我提取文本时,我会错过一行:
page = open(fname)
try:
soup = BeautifulSoup(page, 'html.parser')
except:
sys.exit("cannot parse %s" % fname)
soup.prettify(formatter=lambda s: s.replace(u'\xa0', ' '))
for script in soup(["script", "style"]):
script.extract() # rip it out
if not soup.body:
return
text = soup.body.get_text(separator = ' ')
lines = (clean_str(line) for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
我可以添加一项调整,以将文本正确地分成几行吗?