使用beautifulsoup解析新行

时间:2019-05-22 20:30:45

标签: python html beautifulsoup

使用BeautifulSoup解析html文档时,有时html代码会产生新行,例如

<div style="line-height:120%;text-align:left;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;"><br></font></div><div style="line-height:120%;text-align:left;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;font-weight:bold;">

因此,当我提取文本时,我会错过一行:

page = open(fname)
try:
    soup = BeautifulSoup(page, 'html.parser')
except:
    sys.exit("cannot parse %s" % fname)
soup.prettify(formatter=lambda s: s.replace(u'\xa0', ' '))

for script in soup(["script", "style"]):
    script.extract()    # rip it out
if not soup.body:
    return
text = soup.body.get_text(separator = ' ')
lines = (clean_str(line) for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
text = '\n'.join(chunk for chunk in chunks if chunk)

我可以添加一项调整,以将文本正确地分成几行吗?

0 个答案:

没有答案