我需要清理一个html文件,例如删除冗余的'span'标签。 “span”被认为是多余的,如果它与css文件中的font-weight和font-style的父节点具有相同的格式(我将其转换为字典以便更快地查找)。
html文件如下所示:
<p class="Title">blablabla <span id = "xxxxx">bla</span> prprpr <span id = "yyyyy"> jj </span> </p>
<p class = "norm">blalbla <span id = "aaaa">ttt</span> sskkss <span id = "bbbbbb"> aa </span> </p>
我已经存入字典的css样式:
{'xxxxx':'font-weight: bold; font-size: 8.0pt; font-style: oblique',
'yyyyy':'font-weight: normal; font-size: 9.0pt; font-style: italic',
'aaaa': 'font-weight: bold; font-size: 9.0pt; font-style: italic',
'bbbbbb': 'font-weight: normal; font-size: 9.0pt; font-style: normal',
'Title': 'font-style: oblique; text-align: center; font-weight: bold',
'norm': 'font-style: normal; text-align: center; font-weight: normal'}
因此,鉴于<p Title>
和<span id xxxxx>
,<p norm>
和<span bbbbbb>
在css字典中的font-weight和font-style具有相同的格式,我想得到以下结果:
<p class= "Title">blablabla bla prprpr <span id = "yyyyy"> jj </span> </p>
<p class = "norm">blalbla <span id = "aaaa">ttt</span> sskkss aa </span> </p>
此外,我可以通过查看他们的ID来删除范围:如果它包含“af” - 我删除它们而不查看字典。
所以,在我的剧本中有:
from lxml import etree
from asteval import Interpreter
tree = etree.parse("filename.html")
aeval = Interpreter()
filedic = open('dic_file', 'rb')
fileread = filedic.read()
new_dic = aeval(fileread)
def no_af(tree):
for badspan in tree.xpath("//span[contains(@id, 'af')]"):
badspan.getparent().remove(badspan)
return tree
def no_normal():
no_af(tree)
for span in tree.xpath('.//span'):
span_id = span.xpath('@id')
for x in span_id:
if x in new_dic:
get_style = x
parent = span.getparent()
par_span =parent.xpath('@class')
if par_span:
for ID in par_span:
if ID in new_dic:
get_par_style = ID
if 'font-weight' in new_dic[get_par_style] and 'font-style' in new_dic[get_par_style]:
if 'font-weight' in new_dic[get_style] and 'font-style' in new_dic[get_style]:
if new_dic[get_par_style]['font-weight']==new_dic[get_style]['font-weight'] and new_dic[get_par_style]['font-style']==new_dic[get_style]['font-style']:
etree.strip_tags(parent, 'span')
print etree.tostring(tree, pretty_print =True, method = "html", encoding = "utf-8")
这导致:
AttributeError: 'NoneType' object has no attribute 'xpath'
而且我知道它正好是“etree.strip_tags(parent,'span')”这一行会导致错误,因为当我将它注释掉,并在任何其他行之后使其打印出来时 - 一切正常。
另外,我不确定,使用这个etree.strip_tags(parent,'span')是否能满足我的需要。如果在父级内部有多个具有不同格式的跨度,该怎么办?这个命令是否会剥离所有这些跨度?我需要实际上只剥离一个跨度,当前的一个跨度,在函数的开头,“for span in tree.xpath('.// span'):”
我一整天都在看这个bug,我想我忽视了一些事情......我迫切需要你的帮助!
答案 0 :(得分:2)
lxml
很棒,但它提供了一个非常低级别的&#34; etree&#34;数据结构,并没有内置的最广泛的编辑操作集。你需要的是一个&#34; unwrap&#34;您可以应用于单个元素的操作,以保留其文本,任何子元素及其尾部&#34;在树中,但不是元素本身。这是一个这样的操作(加上需要的辅助函数):
def noneCat(*args):
"""
Concatenate arguments. Treats None as the empty string, though it returns
the None object if all the args are None. That might not seem sensible, but
it works well for managing lxml text components.
"""
for ritem in args:
if ritem is not None:
break
else:
# Executed only if loop terminates through normal exhaustion, not via break
return None
# Otherwise, grab their string representations (empty string for None)
return ''.join((unicode(v) if v is not None else "") for v in args)
def unwrap(e):
"""
Unwrap the element. The element is deleted and all of its children
are pasted in its place.
"""
parent = e.getparent()
prev = e.getprevious()
kids = list(e)
siblings = list(parent)
# parent inherits children, if any
sibnum = siblings.index(e)
if kids:
parent[sibnum:sibnum+1] = kids
else:
parent.remove(e)
# prev node or parent inherits text
if prev is not None:
prev.tail = noneCat(prev.tail, e.text)
else:
parent.text = noneCat(parent.text, e.text)
# last child, prev node, or parent inherits tail
if kids:
last_child = kids[-1]
last_child.tail = noneCat(last_child.tail, e.tail)
elif prev is not None:
prev.tail = noneCat(prev.tail, e.tail)
else:
parent.text = noneCat(parent.text, e.tail)
return e
现在,您已经完成了分解CSS的部分工作,并确定一个CSS选择器(span#id
)是否表明您要将冗余规范视为另一个选择器(p.class
)。让我们扩展它并将其包装成一个函数:
cssdict = { 'xxxxx':'font-weight: bold; font-size: 8.0pt; font-style: oblique',
'yyyyy':'font-weight: normal; font-size: 9.0pt; font-style: italic',
'aaaa': 'font-weight: bold; font-size: 9.0pt; font-style: italic',
'bbbbbb': 'font-weight: normal; font-size: 9.0pt; font-style: normal',
'Title': 'font-style: oblique; text-align: center; font-weight: bold',
'norm': 'font-style: normal; text-align: center; font-weight: normal'
}
RELEVANT = ['font-weight', 'font-style']
def parse_css_spec(s):
"""
Decompose CSS style spec into a dictionary of its components.
"""
parts = [ p.strip() for p in s.split(';') ]
attpairs = [ p.split(':') for p in parts ]
attpairs = [ (k.strip(), v.strip()) for k,v in attpairs ]
return dict(attpairs)
cssparts = { k: parse_css_spec(v) for k,v in cssdict.items() }
# pprint(cssparts)
def redundant_span(span_css_name, parent_css_name, consider=RELEVANT):
"""
Determine if a given span is redundant with respect to its parent,
considering sepecific attribute names. If the span's attributes
values are the same as the parent's, consider it redundant.
"""
span_spec = cssparts[span_css_name]
parent_spec = cssparts[parent_css_name]
for k in consider:
# Any differences => not redundant
if span_spec[k] != parent_spec[k]:
return False
# Everything matches => is redundant
return True
好的,准备工作,主要节目的时间:
import lxml.html
from lxml.html import tostring
source = """
<p class="Title">blablabla <span id = "xxxxx">bla</span> prprpr <span id = "yyyyy"> jj </span> </p>
<p class = "norm">blalbla <span id = "aaaa">ttt</span> sskkss <span id = "bbbbbb"> aa </span> </p>
"""
h = lxml.html.document_fromstring(source)
print "<!-- before -->"
print tostring(h, pretty_print=True)
print
for span in h.xpath('//span[@id]'):
span_id = span.attrib.get('id', None)
parent_class = span.getparent().attrib.get('class', None)
if parent_class is None:
continue
if redundant_span(span_id, parent_class):
unwrap(span)
print "<!-- after -->"
print tostring(h, pretty_print=True)
产量:
<!-- before-->
<html><body>
<p class="Title">blablabla <span id="xxxxx">bla</span> prprpr <span id="yyyyy"> jj </span> </p>
<p class="norm">blalbla <span id="aaaa">ttt</span> sskkss <span id="bbbbbb"> aa </span> </p>
</body></html>
<!-- after -->
<html><body>
<p class="Title">blablabla bla prprpr <span id="yyyyy"> jj </span> </p>
<p class="norm">blalbla <span id="aaaa">ttt</span> sskkss aa </p>
</body></html>
<强>更新强>
第二个想法,你不需要unwrap
。我正在使用它,因为它在我的工具箱中很方便。你可以使用标记扫描方法和etree.strip_tags
来完成它,如下所示:
for span in h.xpath('//span[@id]'):
span_id = span.attrib.get('id', None)
parent_class = span.getparent().attrib.get('class', None)
if parent_class is None:
continue
if redundant_span(span_id, parent_class):
span.tag = "JUNK"
etree.strip_tags(h, "JUNK")