我有这个HTML,
<div id="General" class="detailOn">
<div class="tabconstraint"></div>
<div id="InstitutionMain" class="detailseparate">
<div id="InstitutionMain_divINFORight" style="float:right;width:40%"></div>
<div style="font-weight:bold;padding-top:6px">Special Learning Opportunities</div>
Distance learning opportunities<br>
<div style="font-weight:bold;padding-top:6px">Student Services</div>
Remedial services<br>
Academic/career counseling service<br>
<div style="font-weight:bold;padding-top:6px">Credit Accepted</div>
Dual credit<br>
Credit for life experiences<br>
</div>
</div>
我想提取
text() = between [Div/text() = "Special Learning Opportunities</div>
Distance learning opportunities"] and [div/text()="Student Services"]
类似于其他div
我尝试了这段代码,它为我提供了所识别的div后面的所有文字,
div[1]/div[contains(text(),"Special Learning Opportunities")]/following-sibling::text()
虽然这段代码给了我div之前的所有文字
div[1]/div[contains(text(),"Student Services")]/preceding-sibling::text()
有没有办法准确获取指定DIV之间的所有文本。 提前致谢。
我正在使用python 2.x和scrapy进行抓取。
注意:我目前的方法: - 使用这三个xpath
item['SLO']=site.select('div[1]/div[contains(text(),"Special Learning Opportunities")]/following-sibling::text()').extract()
item['SS']=site.select('div[1]/div[contains(text(),"Student Services")]/following-sibling::text()').extract()
item['CA']=site.select('div[1]/div[contains(text(),"Credit Accepted")]/following-sibling::text()').extract()
我得到三个这样的项目,
item['SLO']=['Distance learning opportunities','Remedial services',' Academic/career counseling service','Dual credit','Credit for life experiences']
item['SS']=['Remedial services',' Academic/career counseling service','Dual credit','Credit for life experiences']
item['CA']=['Dual credit','Credit for life experiences']
然后我在python列表上工作以获得我想要的东西,
但我认为在XPath中应该有更快的方法来实现这一目标。
答案 0 :(得分:4)
您可以将“a和b之间的文本”直接翻译为XPath,作为“text()[previous-sibling = a and next-sibling = b]”
即:
//text()[(preceding-sibling::div[1]/text() = "Special Learning Opportunities") and (following-sibling::div[1]/text() = "Student Services")]
应该有效。
(虽然我测试它时失败了,但它似乎是我的XPath解释器中的一个错误)
答案 1 :(得分:2)
你走了,不像以前的回答那么优雅,但嘿 - 至少它有效! : - )
div[1]//div[contains(text(),"Special Learning Opportunities")]/following-sibling::node()[position() <= count( div[1]//div[contains(text(),"Student Services")]/following-sibling::node()) + 1]
答案 2 :(得分:1)
你可以试试这个..
//div[contains(text(),"Special Learning Opportunities")]//following-sibling::text()[./following-sibling::div[contains(text(),'Student Services')]]