运行此
hxs.select('//*[@id="column_one"]/h2/following-sibling::div[1]').extract()
这是示例输出
<div class="OneLinkNoTx">
<strong>Location:</strong>
Abu Dhabi, United Arab Emirates
</div>
<div class="OneLinkNoTx">
<strong>Travel Percentage:</strong>
None
</div>
<div align="justify">
Salary: 100k
</div>
我希望输出看起来像这样
<div>
<strong>Location:</strong>
Abu Dhabi, United Arab Emirates
</div>
<div>
<strong>Travel Percentage:</strong>
None
</div>
<div>
Salary: 100k
</div>
我只想让html元素没有任何html属性。是否可以使用scrapy / xpath?
答案 0 :(得分:1)
您可以使用lxml's Cleaner。
In [1]: import lxml.html
In [2]: import lxml.html.clean
In [3]: html = """<div class="OneLinkNoTx">
<strong>Location:</strong>
Abu Dhabi, United Arab Emirates
</div>
<div class="OneLinkNoTx">
<strong>Travel Percentage:</strong>
None
</div>
<div align="justify">
Salary: 100k
</div>"""
In [4]: doc = lxml.html.fromstring(html)
In [5]: clean = lxml.html.clean.Cleaner(safe_attrs=frozenset())
In [6]: clean(doc)
In [7]: print lxml.html.tostring(doc)
<div><div>
<strong>Location:</strong>
Abu Dhabi, United Arab Emirates
</div>
<div>
<strong>Travel Percentage:</strong>
None
</div>
<div>
Salary: 100k
</div></div>
缺点是lxml添加了一个包装器div
。为避免这种情况,你可以这样做:
In [28]: elements = lxml.html.fragments_fromstring(html)
In [29]: map(clean, elements)
Out[29]: [None, None, None]
In [30]: print ''.join(map(lxml.html.tostring, elements))
<div>
<strong>Location:</strong>
Abu Dhabi, United Arab Emirates
</div>
<div>
<strong>Travel Percentage:</strong>
None
</div>
<div>
Salary: 100k
</div>
请注意clean
就地修改元素。