我一直在使用Selenium和Python来抓取网页,我很难从具有以下结构的div中收集我想要的数据:
<div class="col span_6" style="margin-left: 12px;width: 47% !important;">
<div class="MainGridRow">
<span class="MainGridcolumn1">Heading1</span>
<span class="MainGridcolumn2">Text that I want</span>
</div>
<div class="MainGridRow">
<span class="MainGridcolumn1">Another heading</span>
<span class="MainGridcolumn2">More text that I want</span>
</div>
<div class="MainGridRow">
<span class="MainGridcolumn1">Next heading</span>
<span class="MainGridcolumn2">Even more text</span>
</div>
<div class="MainGridRow">
<span class="MainGridcolumn1">Yet another heading</span>
<span class="MainGridcolumn2">Piece of text</span>
</div>
</div>
div有许多行,每行有2列,包含span标记内的数据/文本。没有CSS ID。
我只对收集'MainGridcolumn2'范围类中包含的文本感兴趣。
我已尝试使用以下内容导航到第一个标题,然后尝试使用'following_sibling'向下移动到包含文本的下一个span标记,但我甚至无法将其作为当我尝试将其打印到控制台时,它不会返回任何文本:
driver.find_element_by_xpath("//span['@class=MainGridcolumn1'][contains(text(), 'Heading1')]").text
和
driver.find_element_by_xpath("//span[contains(text(), 'Heading1')]").text
答案 0 :(得分:0)
一种方法是获得封闭的div,即祖父母并从中拉出跨度:
h = """<div class="col span_6" style="margin-left: 12px;width: 47% !important;">
<div class="MainGridRow">
<span class="MainGridcolumn1">Heading1</span>
<span class="MainGridcolumn2">Text that I want</span>
</div>
<div class="MainGridRow">
<span class="MainGridcolumn1">Another heading</span>
<span class="MainGridcolumn2">More text that I want</span>
</div>
<div class="MainGridRow">
<span class="MainGridcolumn1">Next heading</span>
<span class="MainGridcolumn2">Even more text</span>
</div>
<div class="MainGridRow">
<span class="MainGridcolumn1">Yet another heading</span>
<span class="MainGridcolumn2">Piece of text</span>
</div>
</div>
<div class="MainGridRow">
<span class="MainGridcolumn1">Yet another heading</span>
<span class="MainGridcolumn2">Piece of text I don't want</span>
</div>"""
from lxml import html
xm = html.fromstring(h)
div = xm.xpath("//span[@class='MainGridcolumn1'][contains(text(), 'Heading1')]/../..")[0]
print(div.xpath(".//span[@class='MainGridcolumn2']/text()"))
哪会给你:
['Text that I want', 'More text that I want', 'Even more text', 'Piece of text']
你也可以选择父母并获得父母的兄弟姐妹
from lxml import html
xm = html.fromstring(h)
div = xm.xpath("//span[@class='MainGridcolumn1'][contains(text(), 'Heading1')]/..")[0]
print(div.xpath(".//span[@class='MainGridcolumn2']/text() | .//following-sibling::div/span[@class='MainGridcolumn2']/text()"))