Question

使用BeautifulSoup解析html文档时，有时html代码会产生新行，例如

<div style="line-height:120%;text-align:left;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;"><br></font></div><div style="line-height:120%;text-align:left;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;font-weight:bold;">

因此，当我提取文本时，我会错过一行：

page = open(fname)
try:
    soup = BeautifulSoup(page, 'html.parser')
except:
    sys.exit("cannot parse %s" % fname)
soup.prettify(formatter=lambda s: s.replace(u'\xa0', ' '))

for script in soup(["script", "style"]):
    script.extract()    # rip it out
if not soup.body:
    return
text = soup.body.get_text(separator = ' ')
lines = (clean_str(line) for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
text = '\n'.join(chunk for chunk in chunks if chunk)

我可以添加一项调整，以将文本正确地分成几行吗？

使用beautifulsoup解析新行

0 个答案: