我正在尝试使用Python将html块转换为文本。
输入:
<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>
期望的输出:
Lorem ipsum dolor sit amet,consectetuer adipiscing elit。 Aenean commodo ligula eget dolor。 Aenean massa
Consectetuer adipiscing elit。 一些 链接Aenean commodo ligula eget dolor。 Aenean massa
Aenean massa.Lorem ipsum dolor sit amet,consectetuer adipiscing elit。 Aenean Philao ligula eget dolor。 Aenean massa
Lorem ipsum dolor坐 amet,consectetuer adipiscing elit。 Aenean commodo ligula eget dolor。 Aenean massa
Consectetuer adipiscing elit。 Aenean commodo ligula eget dolor。 Aenean massa
我尝试过使用html2text模块但没有太大的成功(我对python来说很新):)
这是我尝试过的:
#!/usr/bin/env python
import urllib2
import html2text
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://example.com/page.html').read())
txt = soup.find('div', {'class' : 'body'})
print html2text.html2text(txt)
“txt”对象生成上面的html块。我想将其转换为文本并在屏幕上打印。
非常感谢您对这段代码的任何帮助。
答案 0 :(得分:49)
我错过了什么? soup.get_text()
提供了您想要的完全相同的输出...
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print soup.get_text()
输出
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
PS!确切地说,您可以用双倍替换换行符 - 然后它与您的示例相同:)
soup.get_text().replace('\n','\n\n')
答案 1 :(得分:2)
您可以使用正则表达式......但不建议使用...
以下代码只删除数据中的所有HTML标记,为您提供文字。
import re
data = """<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>"""
data = re.sub(r'<.*?>', '', data)
print data
<强>输出强>
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
答案 2 :(得分:2)
'\n'
在段落之间添加换行符。
from bs4 import Beautifulsoup
soup = Beautifulsoup(text)
print(soup.get_text('\n'))
答案 3 :(得分:1)
我非常喜欢@FrBrGeorge的 答案,以至于我将其扩展为仅提取body
标记,并添加了一种便捷方法,使得HTML到文本只需一行:>
from abc import ABC
from html.parser import HTMLParser
class HTMLFilter(HTMLParser, ABC):
"""
A simple no dependency HTML -> TEXT converter.
Usage:
str_output = HTMLFilter.convert_html_to_text(html_input)
"""
def __init__(self, *args, **kwargs):
self.text = ''
self.in_body = False
super().__init__(*args, **kwargs)
def handle_starttag(self, tag: str, attrs):
if tag.lower() == "body":
self.in_body = True
def handle_endtag(self, tag):
if tag.lower() == "body":
self.in_body = False
def handle_data(self, data):
if self.in_body:
self.text += data
@classmethod
def convert_html_to_text(cls, html: str) -> str:
f = cls()
f.feed(html)
return f.text.strip()
查看用法说明。
这将转换body
内的所有文本,理论上可以包含style
和script
标签。可以通过扩展body
所示的模式来实现进一步过滤-即设置实例变量in_style
或in_script
。
答案 4 :(得分:1)
主要问题是您如何保留一些基本格式。这是我自己保留新行和项目符号的最小方法。我确信这不是您想要保留的所有内容的解决方案,但它是一个起点:
from bs4 import BeautifulSoup
def parse_html(html):
elem = BeautifulSoup(html, features="html.parser")
text = ''
for e in elem.descendants:
if isinstance(e, str):
text += e.strip()
elif e.name in ['br', 'p', 'h1', 'h2', 'h3', 'h4','tr', 'th']:
text += '\n'
elif e.name == 'li':
text += '\n- '
return text
上面为 'br', 'p', 'h1', 'h2', 'h3', 'h4','tr', 'th'
添加了一个新行
并在 -
元素的文本前添加一个带有 li
的新行
答案 5 :(得分:0)
我需要一种在客户端系统上执行此操作的方法,而无需下载其他库。我从来没有找到一个好的解决方案,所以我自己创造了。如果您愿意,请随意使用。
import urllib
def html2text(strText):
str1 = strText
int2 = str1.lower().find("<body")
if int2>0:
str1 = str1[int2:]
int2 = str1.lower().find("</body>")
if int2>0:
str1 = str1[:int2]
list1 = ['<br>', '<tr', '<td', '</p>', 'span>', 'li>', '</h', 'div>' ]
list2 = [chr(13), chr(13), chr(9), chr(13), chr(13), chr(13), chr(13), chr(13)]
bolFlag1 = True
bolFlag2 = True
strReturn = ""
for int1 in range(len(str1)):
str2 = str1[int1]
for int2 in range(len(list1)):
if str1[int1:int1+len(list1[int2])].lower() == list1[int2]:
strReturn = strReturn + list2[int2]
if str1[int1:int1+7].lower() == '<script' or str1[int1:int1+9].lower() == '<noscript':
bolFlag1 = False
if str1[int1:int1+6].lower() == '<style':
bolFlag1 = False
if str1[int1:int1+7].lower() == '</style':
bolFlag1 = True
if str1[int1:int1+9].lower() == '</script>' or str1[int1:int1+11].lower() == '</noscript>':
bolFlag1 = True
if str2 == '<':
bolFlag2 = False
if bolFlag1 and bolFlag2 and (ord(str2) != 10) :
strReturn = strReturn + str2
if str2 == '>':
bolFlag2 = True
if bolFlag1 and bolFlag2:
strReturn = strReturn.replace(chr(32)+chr(13), chr(13))
strReturn = strReturn.replace(chr(9)+chr(13), chr(13))
strReturn = strReturn.replace(chr(13)+chr(32), chr(13))
strReturn = strReturn.replace(chr(13)+chr(9), chr(13))
strReturn = strReturn.replace(chr(13)+chr(13), chr(13))
strReturn = strReturn.replace(chr(13), '\n')
return strReturn
url = "http://www.theguardian.com/world/2014/sep/25/us-air-strikes-islamic-state-oil-isis"
html = urllib.urlopen(url).read()
print html2text(html)
答案 6 :(得分:0)
可以使用BeautifulSoup删除不需要的脚本等,但您可能需要尝试一些不同的网站,以确保您已涵盖了要排除的不同类型的内容。试试这个:
from requests import get
from bs4 import BeautifulSoup as BS
response = get('http://news.bbc.co.uk/2/hi/health/2284783.stm')
soup = BS(response.content, "html.parser")
for child in soup.body.children:
if child.name == 'script':
child.decompose()
print(soup.body.get_text())
答案 7 :(得分:0)
可以使用python标准html.parser
:
from html.parser import HTMLParser
class HTMLFilter(HTMLParser):
text = ""
def handle_data(self, data):
self.text += data
f = HTMLFilter()
f.feed(data)
print(f.text)
答案 8 :(得分:0)
这里有一些不错的东西,我不妨提出自己的解决方案:
from html.parser import HTMLParser
def _handle_data(self, data):
self.text += data + '\n'
HTMLParser.handle_data = _handle_data
def get_html_text(html: str):
parser = HTMLParser()
parser.text = ''
parser.feed(html)
return parser.text.strip()
答案 9 :(得分:0)
gazpacho可能是一个不错的选择!
输入:
from gazpacho import Soup
html = """\
<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>
"""
输出:
text = Soup(html).strip(whitespace=False) # to keep "\n" characters intact
print(text)
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa