Question

我正在尝试将在线html页面翻译成文本。

我对此结构有疑问：

<div align="justify"><b>Available in  
<a href="http://www.example.com.be/book.php?number=1">
French</a> and 
<a href="http://www.example.com.be/book.php?number=5">
English</a>.
</div>

这是它作为python字符串的表示形式：

'<div align="justify"><b>Available in  \r\n<a href="http://www.example.com.be/book.php?number=1">\r\nFrench</a>; \r\n<a href="http://www.example.com.be/book.php?number=5">\r\nEnglish</a>.\r\n</div>'

使用时：

html_content = get_html_div_from_above()
para = BeautifulSoup(html_content)
txt = para.text

BeautifulSoup将它（在'txt'变量中）翻译为：

u'Available inFrenchandEnglish.'

它可能会删除原始html字符串中的每一行。

你对这个问题有一个干净的解决方案吗？

感谢。

Answer 1

我终于得到了一个很好的解决方案：

def clean_line(line):
    return re.sub(r'[ ]{2,}', ' ', re.sub(r'[\r\n]', '', line))

html_content = get_html_div_from_above()
para = BeautifulSoup(html_content)
''.join([clean_line(line) for line in para.findAll(text=True)])

哪个输出：

u'Available in French and English.  '

Answer 2

我得到了一个解决方案：

html_content = get_html_div_from_above()
para = BeautifulSoup(html_content)
txt = para.getText(separator=' ')

但它并不是最佳的，因为它在每个标记之间放置了空格：

u'Available in French and English .  '

注意点之前的空格。

如何防止BeautifulSoup剥离线条

2 个答案: