我有一个网站,我试图刮(虽然不是真正理解HTML)但我已经做了大量的阅读并取得了一些进展。这是一个混乱的网站,但重要的部分看起来像这样:
<h1>
<b>DESCRIPTOR1: </b>
" important content "
<br>
<b>DESCRIPTOr2: </b>
" important content"
<hr>
</h1>
<b>Title1</b>
" A lot of important text"
<br>
<br>
<b>Title2</b>
"A lot of important text"
<br>
<br>
<b>Title3</b>
<br>
"1. List of text pertaining to Title3 "
<br>
"2. List of items for Title 3"
<br>
"3. the number of listed items is variable for every page"
<br>
"4. Sometimes no list at all"
<br>
<br>
<b> Next Title: </b>
....and so on
现在我可以非常接近我想要的最终结果,除非我到达Title 3并且在标题3的内容之前有一个<br>
。这就是我接近的方式它:
import lxml.html
htmltree = lxml.html.parse('sample.html')
items = htmltree.xpath('//*[@id="sampletext"]/b')
for node in items:
print (node.text.strip())
print node.tail
现在我的两个问题是(1):我无法从.tail's
和(2)中删除空格:我为Title3返回“无”,因为之前没有.tail
下一个元素是<br>
。理想情况下,我可以在元素标记之间添加任何文本,直到我到达下一个标识符标记,在本例中为<b>
。希望有道理。有什么指示吗?
答案 0 :(得分:2)
您可以尝试使用以下XPath表达式:
for item in items:
result = item.xpath('following-sibling::text()[normalize-space()][preceding-sibling::b[1] = $b]', b=item)
print [r.strip() for r in result]
针对相关HTML代码段测试时的输出:
['" A lot of important text"']
['"A lot of important text"']
['"1. List of text pertaining to Title3 "', '"2. List of items for Title 3"', '"3. the number of listed items is variable for every page"', '"4. Sometimes no list at all"']
[]
关于XPath的简要说明:
following-sibling::text()[normalize-space()]
:找到非空的,跟随兄弟的文本节点...... [preceding-sibling::b[1] = $b]
:...其中最近的前兄弟b
元素等于$b
。 $b
是一个XPath参数,在上面的代码中用当前item
替换。这由xpath()
方法(b=item
)