我有以下代码:
soup = BeautifulSoup(text)
for elem in soup.find_all('span', 'finereader'):
elem.replace_with(elem.string or '')
我想使用lxml,因为我无法使用BS产生的缩进。 是否有使用lxml的等效代码?或者如何省略BS的缩进?
非常感谢您的帮助:)
编辑: BS产生如下输出:
<html>
<body>
<table border="0" cellpadding="0" cellspacing="0" class="main" frame="box" rules="all" style="table-layout:fixed; width:324.72pt; height:518.64pt;">
<tr class="row">
<td class="cell" style=" width:0.00pt; height:0.00pt;" valign="top">
</td>
<td class="cell" style=" width:169.44pt; height:0.00pt;" valign="top">
</td>
但我想输出如下:
<html>
<body>
<table border="0" cellpadding="0" cellspacing="0" class="main" frame="box" rules="all" style="table-layout:fixed; width:324.72pt; height:518.64pt;">
<tr class="row">
<td class="cell" style=" width:0.00pt; height:0.00pt;" valign="top">
</td>
<td class="cell" style=" width:169.44pt; height:0.00pt;" valign="top">
</td>
编辑: 我的整个代码现在看起来像这样。
output = codecs.open("test.html", "a", "utf-8")
def myfunct():
for i in range(1, 11):
root = lxml.html.parse('http://xyz.xy'+str(nr)+'?action=source').getroot()
for elem in root.xpath("//span[@class='finereader']"):
text = (elem.text or "") + (elem.tail or "")
if elem.getprevious(): # If there's a previous node
previous = elem.getprevious()
previous.tail = (previous.tail or "") + text # append to its tail
else:
parent = elem.getparent() # Otherwise use the parent
parent.text = (parent.text or "") + text # and append to its text
elem.getparent().remove(elem)
for empty in root.xpath('//*[self::b or self::i][not(node())]'):
empty.getparent().remove(empty)
tables = root.cssselect('table.main') #root.xpath('//table[@class="main" and not(ancestor::table[@class="main"])]') #
tables = root.xpath('//table[@class="main" and not(ancestor::table[@class="main"])]')
txt = []
txt += ([lxml.html.tostring(t, method="html", encoding="utf-8") for t in tables])
text = "\n".join(re.sub(r'\[:[\/]?T.*?:\]', '', el) for el in txt) #.splitlines())
output.write(text.decode("utf-8"))
答案 0 :(得分:0)
要解析,请制作lxml.etree.HTMLParser
并使用lxml.etree.fromstring
:
import lxml.etree
parser = lxml.etree.HTMLParser()
html = lxml.etree.fromstring(text, parser)
您现在可以使用xpath选择所需的内容:
for elem in html.xpath("//span[@class='finereader']"):
然后,由于lxml不允许您添加文本节点,而是处理节点的text
和tail
内容,我们必须做一些魔术来用字符串替换节点:
text = (elem.text or "") + (elem.tail or "")
if elem.getprevious() is not None: # If there's a previous node
previous = elem.getprevious()
previous.tail = (previous.tail or "") + text # append to its tail
else:
parent = elem.getparent() # Otherwise use the parent
parent.text = (parent.text or "") + text # and append to its text
elem.getparent().remove(elem)
然后,您可以使用lxml.etree.tostring(html)
来恢复文本。