from mechanize import Browser
br = Browser()
br.open('http://somewebpage')
html = br.response().readlines()
for line in html:
print line
在HTML文件中打印一行时,我试图找到一种方法来只显示每个HTML元素的内容而不是格式本身。如果它找到'<a href="whatever.com">some text</a>'
,它只会打印'some text','<b>hello</b>'
打印'hello'等等。怎么会这样做呢?
答案 0 :(得分:386)
我总是使用此函数来去除HTML标记,因为它只需要Python stdlib:
在Python 2上
from HTMLParser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
对于Python 3
from html.parser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.strict = False
self.convert_charrefs= True
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
注意:这仅适用于3.1。对于3.2或更高版本,您需要调用父类的 init 函数。见Using HTMLParser in Python 3.2
答案 1 :(得分:138)
我没有想过它会错过的案例,但你可以做一个简单的正则表达式:
re.sub('<[^<]+?>', '', text)
对于那些不了解正则表达式的人,会搜索字符串<...>
,其内部内容由一个或多个(+
)字符组成,而不是{{1} }。 <
表示它将匹配它可以找到的最小字符串。例如,给定?
,它将<p>Hello</p>
和<'p>
分别与</p>
匹配。没有它,它将匹配整个字符串?
。
如果非标签<..Hello..>
出现在html中(例如<
),那么它应该被写为转义序列2 < 3
,因此&...
可能是不必要的。
答案 2 :(得分:41)
为什么你们所有人都这么做?
您可以使用BeautifulSoup get_text()
功能。
from bs4 import BeautifulSoup
html_str = '''
<td><a href="http://www.fakewebsite.com">Please can you strip me?</a>
<br/><a href="http://www.fakewebsite.com">I am waiting....</a>
</td>
'''
soup = BeautifulSoup(html_str)
print(soup.get_text())
#or via attribute of Soup Object: print(soup.text)
答案 3 :(得分:29)
import re, cgi
tag_re = re.compile(r'(<!--.*?-->|<[^>]*>)')
# Remove well-formed tags, fixing mistakes by legitimate users
no_tags = tag_re.sub('', user_input)
# Clean up anything else by escaping
ready_for_web = cgi.escape(no_tags)
Regex source: MarkupSafe。他们的版本也处理HTML实体,而这个快速版本则没有。
让人们远离<i>italicizing</i>
事情是一回事,而不会让i
漂浮在身边。但是接受任意输入并使其完全无害是另一回事。此页面上的大多数技术都会保留未关闭的注释(<!--
)和不包含标记(blah <<<><blah
)的角括号等内容。 HTMLParser版本甚至可以保留完整的标签,如果它们在未公开的评论中。
如果您的模板是{{ firstname }} {{ lastname }}
怎么办? firstname = '<a'
和lastname = 'href="http://evil.com/">'
将通过此页面上的每个代码剥离器(@Medeiros除外)通过,因为它们本身并不是完整的标记。剥离普通的HTML标签是不够的。
Django的strip_tags
,这个问题的最佳答案的改进版(见下一个标题),给出了以下警告:
绝对不保证生成的字符串是HTML安全的。因此,永远不要在
strip_tags
调用的结果标记为安全,而不首先将其转义,例如使用escape()
。
听从他们的意见!
很容易绕过这个问题的最佳答案。
查看此字符串(source and discussion):
<img<!-- --> src=x onerror=alert(1);//><!-- -->
HTMLParser第一次看到它时,它无法判断<img...>
是一个标记。它看起来很破碎,所以HTMLParser并没有摆脱它。它只会取出<!-- comments -->
,而只剩下
<img src=x onerror=alert(1);//>
这个问题已于2014年3月向Django项目披露。他们的旧版strip_tags
基本上与此问题的最佳答案相同。 Their new version基本上在循环中运行它,直到再次运行它不会更改字符串:
# _strip_once runs HTMLParser once, pulling out just the text of all the nodes.
def strip_tags(value):
"""Returns the given HTML with all tags stripped."""
# Note: in typical case this loop executes _strip_once once. Loop condition
# is redundant, but helps to reduce number of executions of _strip_once.
while '<' in value and '>' in value:
new_value = _strip_once(value)
if len(new_value) >= len(value):
# _strip_once was not able to detect more tags
break
value = new_value
return value
当然,如果你总是逃避strip_tags()
的结果,这一切都不是问题。
更新2015年3月19日:在1.4.20,1.6.11,1.7.7和1.8c1之前的Django版本中存在一个错误。这些版本可以在strip_tags()函数中进入无限循环。固定版本在上面复制。 More details here
我的示例代码不处理HTML实体 - Django和MarkupSafe打包版本。
我的示例代码是从优秀的MarkupSafe库中提取的,用于防止跨站点脚本编写。它方便快捷(C加速到其原生Python版本)。它包含在Google App Engine中,由Jinja2 (2.7 and up),Mako,Pylons等使用。它可以轻松地使用Django 1.7中的Django模板。
Django的strip_tags和其他html实用程序来自最近的版本很好,但我发现它们不如MarkupSafe方便。它们非常独立,您可以从this file复制所需内容。
如果您需要删除几乎所有标记,Bleach库就可以了。你可以让它强制执行“我的用户可以用斜体表示事物,但他们不能制作iframe”这样的规则。
了解标签剥离器的属性!对它进行模糊测试! Here is the code我曾经为这个答案做过研究。
羞怯的说明 - 问题本身是关于打印到控制台,但这是“python strip html from string”的Google最高结果,所以这就是为什么这个答案是关于网络的99%
答案 4 :(得分:28)
我需要一种方法来剥离标签和将HTML实体解码为纯文本。以下解决方案基于Eloff的答案(我无法使用,因为它剥离了实体)。
from HTMLParser import HTMLParser
import htmlentitydefs
class HTMLTextExtractor(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.result = [ ]
def handle_data(self, d):
self.result.append(d)
def handle_charref(self, number):
codepoint = int(number[1:], 16) if number[0] in (u'x', u'X') else int(number)
self.result.append(unichr(codepoint))
def handle_entityref(self, name):
codepoint = htmlentitydefs.name2codepoint[name]
self.result.append(unichr(codepoint))
def get_text(self):
return u''.join(self.result)
def html_to_text(html):
s = HTMLTextExtractor()
s.feed(html)
return s.get_text()
快速测试:
html = u'<a href="#">Demo <em>(¬ \u0394ημώ)</em></a>'
print repr(html_to_text(html))
结果:
u'Demo (\xac \u0394\u03b7\u03bc\u03ce)'
错误处理:
&#apos;
,在XML和XHTML中有效,但不是纯HTML)将导致ValueError
例外。ValueError
异常。安全说明:不要将HTML剥离(将HTML转换为纯文本)与HTML清理(将纯文本转换为HTML)混淆。这个答案将删除HTML并将实体解码为纯文本 - 这不会使结果在HTML上下文中安全使用。
示例:<script>alert("Hello");</script>
将转换为<script>alert("Hello");</script>
,这是100%正确的行为,但如果生成的纯文本按原样插入到HTML页面中,则显然不够。
规则并不难:任何时候您将纯文本字符串插入HTML输出中,您应该总是 HTML将其转义(使用cgi.escape(s, True)
)即使你“知道”它不包含HTML(例如,因为你剥离了HTML内容)。
(但是,OP询问是否将结果打印到控制台,在这种情况下不需要HTML转义。)
Python 3.4+版本:(使用doctest!)
import html.parser
class HTMLTextExtractor(html.parser.HTMLParser):
def __init__(self):
super(HTMLTextExtractor, self).__init__()
self.result = [ ]
def handle_data(self, d):
self.result.append(d)
def get_text(self):
return ''.join(self.result)
def html_to_text(html):
"""Converts HTML to plain text (stripping tags and converting entities).
>>> html_to_text('<a href="#">Demo<!--...--> <em>(¬ \u0394ημώ)</em></a>')
'Demo (\xac \u0394\u03b7\u03bc\u03ce)'
"Plain text" doesn't mean result can safely be used as-is in HTML.
>>> html_to_text('<script>alert("Hello");</script>')
'<script>alert("Hello");</script>'
Always use html.escape to sanitize text before using in an HTML context!
HTMLParser will do its best to make sense of invalid HTML.
>>> html_to_text('x < y < z <!--b')
'x < y < z '
Unrecognized named entities are included as-is. ''' is recognized,
despite being XML only.
>>> html_to_text('&nosuchentity; ' ')
"&nosuchentity; ' "
"""
s = HTMLTextExtractor()
s.feed(html)
return s.get_text()
请注意,HTMLParser在Python 3中得到了改进(意味着代码更少,错误处理更好)。
答案 5 :(得分:18)
有一种简单的方法:
def remove_html_markup(s):
tag = False
quote = False
out = ""
for c in s:
if c == '<' and not quote:
tag = True
elif c == '>' and not quote:
tag = False
elif (c == '"' or c == "'") and tag:
quote = not quote
elif not tag:
out = out + c
return out
这个想法在这里解释:http://youtu.be/2tu9LTDujbw
您可以在此处看到它:http://youtu.be/HPkNPcYed9M?t=35s
PS - 如果您对该课程感兴趣(关于使用python进行智能调试),我会给你一个链接:http://www.udacity.com/overview/Course/cs259/CourseRev/1。免费!
欢迎你! :)
答案 6 :(得分:16)
如果您需要保留HTML实体(例如&
),我将“handle_entityref”方法添加到Eloff's answer。
from HTMLParser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def handle_entityref(self, name):
self.fed.append('&%s;' % name)
def get_data(self):
return ''.join(self.fed)
def html_to_text(html):
s = MLStripper()
s.feed(html)
return s.get_data()
答案 7 :(得分:12)
如果要删除所有HTML标记,我发现的最简单方法是使用BeautifulSoup:
from bs4 import BeautifulSoup # Or from BeautifulSoup import BeautifulSoup
def stripHtmlTags(htmlTxt):
if htmlTxt is None:
return None
else:
return ''.join(BeautifulSoup(htmlTxt).findAll(text=True))
我尝试了接受答案的代码,但是我得到了“RuntimeError:超出了最大递归深度”,这在上面的代码块中没有发生。
答案 8 :(得分:9)
基于lxml.html的解决方案(lxml是一个本机库,因此比任何纯python解决方案都快得多)。
from lxml import html
from lxml.html.clean import clean_html
tree = html.fromstring("""<span class="item-summary">
Detailed answers to any questions you might have
</span>""")
print(clean_html(tree).strip())
# >>> Detailed answers to any questions you might have
另请参阅http://lxml.de/lxmlhtml.html#cleaning-up-html了解lxml.cleaner的确切内容。
如果您需要在转换为文本之前更好地控制清理的内容,那么您可能希望通过在构造函数中传递lxml Cleaner来显式使用options you want,例如:
cleaner = Cleaner(page_structure=True,
meta=True,
embedded=True,
links=True,
style=True,
processing_instructions=True,
inline_style=True,
scripts=True,
javascript=True,
comments=True,
frames=True,
forms=True,
annoying_tags=True,
remove_unknown_tags=True,
safe_attrs_only=True,
safe_attrs=frozenset(['src','color', 'href', 'title', 'class', 'name', 'id']),
remove_tags=('span', 'font', 'div')
)
sanitized_html = cleaner.clean_html(unsafe_html)
答案 9 :(得分:7)
美丽的汤包立即为您做到这一点。
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
text = soup.get_text()
print(text)
答案 10 :(得分:2)
您可以使用不同的HTML解析器(like lxml或Beautiful Soup) - 提供仅提取文本的功能的解析器。或者,您可以在删除标记的行字符串上运行正则表达式。有关详情,请参阅http://www.amk.ca/python/howto/regex/。
答案 11 :(得分:1)
HTML-Parser解决方案只有运行一次才能破解:
html_to_text('<<b>script>alert("hacked")<</b>/script>
结果:
<script>alert("hacked")</script>
你打算阻止什么。如果您使用HTML-Parser,请计算标签直到零被替换:
from HTMLParser import HTMLParser
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
self.containstags = False
def handle_starttag(self, tag, attrs):
self.containstags = True
def handle_data(self, d):
self.fed.append(d)
def has_tags(self):
return self.containstags
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
must_filtered = True
while ( must_filtered ):
s = MLStripper()
s.feed(html)
html = s.get_data()
must_filtered = s.has_tags()
return html
答案 12 :(得分:1)
这里的解决方案与当前接受的答案(https://stackoverflow.com/a/925630/95989)类似,不同的是它直接使用内部HTMLParser
类(即没有子类),从而使其简洁得多:
def strip_html(text): parts = [] parser = HTMLParser() parser.handle_data = parts.append parser.feed(text) return ''.join(parts)
答案 13 :(得分:1)
对于一个项目,我需要这样剥离HTML,还需要css和js。因此,我改编了Eloffs的回答:
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.strict = False
self.convert_charrefs= True
self.fed = []
self.css = False
def handle_starttag(self, tag, attrs):
if tag == "style" or tag=="script":
self.css = True
def handle_endtag(self, tag):
if tag=="style" or tag=="script":
self.css=False
def handle_data(self, d):
if not self.css:
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
答案 14 :(得分:1)
这是一个简单的解决方案,它基于惊人的快速lxml
库来剥离HTML标签并解码HTML实体:
from lxml import html
def strip_html(s):
return html.fromstring(s).text_content()
strip_html('Ein <a href="">schöner</a> Text.') # Output: Ein schöner Text.
答案 15 :(得分:1)
søren-løvborg答案的蟒蛇3改编
from html.parser import HTMLParser
from html.entities import html5
class HTMLTextExtractor(HTMLParser):
""" Adaption of http://stackoverflow.com/a/7778368/196732 """
def __init__(self):
super().__init__()
self.result = []
def handle_data(self, d):
self.result.append(d)
def handle_charref(self, number):
codepoint = int(number[1:], 16) if number[0] in (u'x', u'X') else int(number)
self.result.append(unichr(codepoint))
def handle_entityref(self, name):
if name in html5:
self.result.append(unichr(html5[name]))
def get_text(self):
return u''.join(self.result)
def html_to_text(html):
s = HTMLTextExtractor()
s.feed(html)
return s.get_text()
答案 16 :(得分:1)
这是一个快速修复,可以更加优化,但它可以正常工作。此代码将使用&#34;&#34;替换所有非空标记。并删除所有html标签形成给定的输入文本。您可以使用./file.py输入输出
运行它 #!/usr/bin/python
import sys
def replace(strng,replaceText):
rpl = 0
while rpl > -1:
rpl = strng.find(replaceText)
if rpl != -1:
strng = strng[0:rpl] + strng[rpl + len(replaceText):]
return strng
lessThanPos = -1
count = 0
listOf = []
try:
#write File
writeto = open(sys.argv[2],'w')
#read file and store it in list
f = open(sys.argv[1],'r')
for readLine in f.readlines():
listOf.append(readLine)
f.close()
#remove all tags
for line in listOf:
count = 0;
lessThanPos = -1
lineTemp = line
for char in lineTemp:
if char == "<":
lessThanPos = count
if char == ">":
if lessThanPos > -1:
if line[lessThanPos:count + 1] != '<>':
lineTemp = replace(lineTemp,line[lessThanPos:count + 1])
lessThanPos = -1
count = count + 1
lineTemp = lineTemp.replace("<","<")
lineTemp = lineTemp.replace(">",">")
writeto.write(lineTemp)
writeto.close()
print "Write To --- >" , sys.argv[2]
except:
print "Help: invalid arguments or exception"
print "Usage : ",sys.argv[0]," inputfile outputfile"
答案 17 :(得分:1)
我已成功使用Eloff的答案用于Python 3.1 [非常感谢!]。
我升级到Python 3.2.3,并遇到了错误。
感谢响应者Thomas K提供的here解决方案是将super().__init__()
插入以下代码中:
def __init__(self):
self.reset()
self.fed = []
...为了使它看起来像这样:
def __init__(self):
super().__init__()
self.reset()
self.fed = []
...它适用于Python 3.2.3。
再次感谢Thomas K的修复以及上面提供的Eloff的原始代码!
答案 18 :(得分:0)
# This is a regex solution.
import re
def removeHtml(html):
if not html: return html
# Remove comments first
innerText = re.compile('<!--[\s\S]*?-->').sub('',html)
while innerText.find('>')>=0: # Loop through nested Tags
text = re.compile('<[^<>]+?>').sub('',innerText)
if text == innerText:
break
innerText = text
return innerText.strip()
答案 19 :(得分:0)
我就是这样做的,但我不知道我在做什么。我通过去除 HTML 标签从 HTML 表中获取数据。
这需要字符串“name”并返回不带 HTML 标签的字符串“name1”。
x = 0
anglebrackets = 0
name1 = ""
while x < len(name):
if name[x] == "<":
anglebrackets = anglebrackets + 1
if name[x] == ">":
anglebrackets = anglebrackets - 1
if anglebrackets == 0:
if name[x] != ">":
name1 = name1 + name[x]
x = x + 1
答案 20 :(得分:0)
2020更新
使用Mozilla Bleach library,它实际上使您可以自定义要保留的标签和要保留的属性,并根据值过滤出属性
这里有2种情况可以说明
1)不允许任何HTML标记或属性
获取示例原始文本
raw_text = """
<p><img width="696" height="392" src="https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432.jpg" class="attachment-medium_large size-medium_large wp-post-image" alt="Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC" style="float:left; margin:0 15px 15px 0;" srcset="https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432.jpg 768w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-300x169.jpg 300w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1024x576.jpg 1024w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-696x392.jpg 696w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1068x601.jpg 1068w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-747x420.jpg 747w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-190x107.jpg 190w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-380x214.jpg 380w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-760x428.jpg 760w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc.jpg 1280w" sizes="(max-width: 696px) 100vw, 696px" />Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform’s users. Also as part […]</p>
<p>The post <a rel="nofollow" href="https://news.bitcoin.com/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc/">Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC</a> appeared first on <a rel="nofollow" href="https://news.bitcoin.com">Bitcoin News</a>.</p>
"""
2)从原始文本中删除所有HTML标签和属性
# DO NOT ALLOW any tags or any attributes
from bleach.sanitizer import Cleaner
cleaner = Cleaner(tags=[], attributes={}, styles=[], protocols=[], strip=True, strip_comments=True, filters=None)
print(cleaner.clean(raw_text))
输出
Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform’s users. Also as part […]
The post Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC appeared first on Bitcoin News.
3仅允许具有srcset属性的img标签
from bleach.sanitizer import Cleaner
# ALLOW ONLY img tags with src attribute
cleaner = Cleaner(tags=['img'], attributes={'img': ['srcset']}, styles=[], protocols=[], strip=True, strip_comments=True, filters=None)
print(cleaner.clean(raw_text))
输出
<img srcset="https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-768x432.jpg 768w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-300x169.jpg 300w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1024x576.jpg 1024w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-696x392.jpg 696w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-1068x601.jpg 1068w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-747x420.jpg 747w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-190x107.jpg 190w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-380x214.jpg 380w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc-760x428.jpg 760w, https://news.bitcoin.com/wp-content/uploads/2020/08/ethereum-classic-51-attack-okex-crypto-exchange-suffers-5-6-million-loss-contemplates-delisting-etc.jpg 1280w">Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform’s users. Also as part […]
The post Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC appeared first on Bitcoin News.
答案 21 :(得分:0)
hext
是一个可以打包strip HTML的程序包。它是getLines
的替代方法。以下内容已通过beautifulsoup
进行了测试。
将其保存在实用程序模块中,例如hext==0.2.3
:
util/hext.py
用法示例:
import hext
_HTML_TEXT_RULE = hext.Rule('<html @text:text />')
def html_to_text(text: str) -> str:
# Ref: https://stackoverflow.com/a/56894409/
return _HTML_TEXT_RULE.extract(hext.Html(f'<html>{text}</html>'))[0]['text']
HTML格式错误的使用示例:
>>> from .util.hext import html_to_text
>>> html_to_text('<b>Hello world!</b>')
'Hello world!'
>>> html_to_text('<a href="google.com">some text</a>')
'some text'
>>> html_to_text('<span class="small-caps">l</span>-arginine minimizes immunosuppression and prothrombin time and enhances the genotoxicity of 5-fluorouracil in rats')
'l-arginine minimizes immunosuppression and prothrombin time and enhances the genotoxicity of 5-fluorouracil in rats'
>>> html_to_text('Attenuation of diabetic nephropathy by dietary fenugreek (<em>Trigonella foenum-graecum</em>) seeds and onion (<em>Allium cepa</em>) <em>via</em> suppression of glucose transporters and renin-angiotensin system')
'Attenuation of diabetic nephropathy by dietary fenugreek (Trigonella foenum-graecum) seeds and onion (Allium cepa) via suppression of glucose transporters and renin-angiotensin system'
答案 22 :(得分:0)
简单的代码!。这将删除其中的所有标记和内容。
def rm(s):
start=False
end=False
s=' '+s
for i in range(len(s)-1):
if i<len(s):
if start!=False:
if s[i]=='>':
end=i
s=s[:start]+s[end+1:]
start=end=False
else:
if s[i]=='<':
start=i
if s.count('<')>0:
self.rm(s)
else:
s=s.replace(' ', ' ')
return s
但是,如果文本中包含 <> 个符号,则不会给出完整的结果。
答案 23 :(得分:0)
这是我对python 3的解决方案。
import html
import re
def html_to_txt(html_text):
## unescape html
txt = html.unescape(html_text)
tags = re.findall("<[^>]+>",txt)
print("found tags: ")
print(tags)
for tag in tags:
txt=txt.replace(tag,'')
return txt
不确定它是否完美,但解决了我的用例,看起来很简单。
答案 24 :(得分:0)
使用BeautifulSoup,html2text或来自@Eloff的代码,大多数情况下,它仍然是一些html元素,javascript代码......
因此,您可以使用这些库的组合并删除降价格式(Python 3):
import re
import html2text
from bs4 import BeautifulSoup
def html2Text(html):
def removeMarkdown(text):
for current in ["^[ #*]{2,30}", "^[ ]{0,30}\d\\\.", "^[ ]{0,30}\d\."]:
markdown = re.compile(current, flags=re.MULTILINE)
text = markdown.sub(" ", text)
return text
def removeAngular(text):
angular = re.compile("[{][|].{2,40}[|][}]|[{][*].{2,40}[*][}]|[{][{].{2,40}[}][}]|\[\[.{2,40}\]\]")
text = angular.sub(" ", text)
return text
h = html2text.HTML2Text()
h.images_to_alt = True
h.ignore_links = True
h.ignore_emphasis = False
h.skip_internal_links = True
text = h.handle(html)
soup = BeautifulSoup(text, "html.parser")
text = soup.text
text = removeAngular(text)
text = removeMarkdown(text)
return text
它适合我,但它可以增强,当然......
答案 25 :(得分:0)
我正在解析Github自述文件,我发现以下内容确实运行良好:
import re
import lxml.html
def strip_markdown(x):
links_sub = re.sub(r'\[(.+)\]\([^\)]+\)', r'\1', x)
bold_sub = re.sub(r'\*\*([^*]+)\*\*', r'\1', links_sub)
emph_sub = re.sub(r'\*([^*]+)\*', r'\1', bold_sub)
return emph_sub
def strip_html(x):
return lxml.html.fromstring(x).text_content() if x else ''
然后
readme = """<img src="https://raw.githubusercontent.com/kootenpv/sky/master/resources/skylogo.png" />
sky is a web scraping framework, implemented with the latest python versions in mind (3.4+).
It uses the asynchronous `asyncio` framework, as well as many popular modules
and extensions.
Most importantly, it aims for **next generation** web crawling where machine intelligence
is used to speed up the development/maintainance/reliability of crawling.
It mainly does this by considering the user to be interested in content
from *domains*, not just a collection of *single pages*
([templating approach](#templating-approach))."""
strip_markdown(strip_html(readme))
正确删除所有降价和HTML。
答案 26 :(得分:0)
您可以编写自己的功能:
def StripTags(text):
finished = 0
while not finished:
finished = 1
start = text.find("<")
if start >= 0:
stop = text[start:].find(">")
if stop >= 0:
text = text[:start] + text[start+stop+1:]
finished = 0
return text
答案 27 :(得分:-2)
这种方法对我来说完美无缺,无需额外安装:
import re
import htmlentitydefs
def convertentity(m):
if m.group(1)=='#':
try:
return unichr(int(m.group(2)))
except ValueError:
return '&#%s;' % m.group(2)
try:
return htmlentitydefs.entitydefs[m.group(2)]
except KeyError:
return '&%s;' % m.group(2)
def converthtml(s):
return re.sub(r'&(#?)(.+?);',convertentity,s)
html = converthtml(html)
html.replace(" ", " ") ## Get rid of the remnants of certain formatting(subscript,superscript,etc).