通过xpath选择元素

时间:2020-01-15 13:13:30

标签: python xpath web-scraping

我有此页面 https://www.punters.com.au/form-guide/2020-01-14/

有些赛事名称如Spendthrift Australia Park,Dalby等。 我想找到一种提取特定国家/地区比赛的方法。例如,我的剧本应该在澳大利亚参加比赛。但我不知道该如何对这些种族进行正确的xpath操作。因为比赛次数每次都不一样。 或任何其他国家。 我只需要正确的xpath

from selenium import webdriver

country = input('Enter country name (ex Australia, New Zealand..): ')
driver = webdriver.Chrome()
driver.get("https://www.punters.com.au/form-guide/2020-01-14/")
for i in driver.find_elements_by_xpath("//tr[./td/img[@title='Australia']]//following-sibling::tr/td[@class='upcoming-race__td upcoming-race__meeting-name upcoming-races__show-pdfs']//following-sibling::td[1]/a".format(country)):
    print(i.text)

driver.close()

2 个答案:

答案 0 :(得分:1)

如果要选择具有一个“ magic xpath”的目标节点,则为:

from selenium import webdriver

country = 'South Africa'
driver = webdriver.Chrome()
driver.get("https://www.punters.com.au/form-guide/2020-01-14/")

xpath = f"//tr[preceding-sibling::tr[contains(@class, 'upcoming-race__row--country')]/td/img[@title='{country}']][position()<=(count(//tr[preceding-sibling::tr[contains(@class, 'upcoming-race__row--country')]/td/img[@title='{country}']])-count(//tr[preceding-sibling::tr[preceding-sibling::tr[contains(@class, 'upcoming-race__row--country')]/td/img[@title='{country}'] and contains(@class, 'upcoming-race__row--country')][1]])-count(//tr[preceding-sibling::tr[contains(@class, 'upcoming-race__row--country')]/td/img[@title='{country}'] and contains(@class, 'upcoming-race__row--country')][1]))]/td[1]"
found_nodes = driver.find_elements_by_xpath(xpath)

driver.close()

让我们描述此XPath在新西兰示例中的作用:

我将为XPath的块加上别名,以使结果概念的可读性更好。

1。第一部分是关于寻找起点的信息-让我们找到带有新西兰头文件(以TARGET_XPATH为别名)的节点

`//tr[preceding-sibling::tr[contains(@class, 'upcoming-race__row--country')]/td/img[@title='New Zealand']]`

2。。现在我们需要将找到的结果限制为仅单个国家/地区。 我知道在当前情况下此操作的最佳选择-“位置”运算符。 我们必须在结果中提供最后一个有用元素的位置(在第一个“垃圾”之前)。让我们计算一下:

`(count(//tr[preceding-sibling::tr[contains(@class, 'upcoming-race__row--country')]/td/img[@title='New Zealand']])-count(//tr[preceding-sibling::tr[preceding-sibling::tr[contains(@class, 'upcoming-race__row--country')]/td/img[@title='New Zealand'] and contains(@class, 'upcoming-race__row--country')][1]])-count(//tr[preceding-sibling::tr[contains(@class, 'upcoming-race__row--country')]/td/img[@title='New Zealand'] and contains(@class, 'upcoming-race__row--country')][1]))`

我们在这里计数三种类型的元素:

a。国家标头节点(命名为COUNT_TOTALS个)之后的节点数:

count(//tr[preceding-sibling::tr[contains(@class, 'upcoming-race__row--country')]/td/img[@title='New Zealand']])

b。第一个“混乱”节点(命名为COUNT_AFTER_TRASHY_HEADER)之后的节点数:

count(//tr[preceding-sibling::tr[preceding-sibling::tr[contains(@class, 'upcoming-race__row--country')]/td/img[@title='New Zealand'] and contains(@class, 'upcoming-race__row--country')][1]])

c。并且我们必须检查任何“无用”节点,以防万一,当我们在表格中搜索比赛中的最后一个国家时-它不会有下一个“无用”节点(命名为COUNT_TRASHY_HEADER)

count(//tr[preceding-sibling::tr[contains(@class, 'upcoming-race__row--country')]/td/img[@title='New Zealand'] and contains(@class, 'upcoming-race__row--country')][1])

3。。使用我们的计数作为过滤器:

TARGET_XPATH[position()<=(COUNT_TOTALS - COUNT_AFTER_TRASHY_HEADER - COUNT_TRASHY_HEADER)]

答案 1 :(得分:0)

让我们这样尝试(仅适用于澳大利亚):

from selenium import webdriver
import pandas as pd

driver = webdriver.Chrome()
driver.get("https://www.punters.com.au/form-guide/2020-01-14/")

tabs = driver.find_elements_by_xpath('//table')
rows = []
for i in tabs[0].find_elements_by_xpath("//tr[./td/img[@title='Australia']]/following-sibling::tr[position()<5]"):
    row = []
    for dat in i.find_elements_by_xpath('.//td'):        
        row.append(dat.text)
    rows.append(row)
pd.DataFrame(rows)

输出(请格式化)

             0  1   2   3   4   5   6   7   8   9   10
0   Spendthrift Australia Park  ABD ABD ABD ABD ABD ABD ABD ABD     
1   Dalby   6,2 3,2 8,9,4   8,4,7   10,5,1  ABD 8,9,6   3,6,4   11,9,1  6,1,5
2   Corowa  3,1,4   6,4,3   2,4 2,1,5   2,7,9   12,2,6  3,1,6           
3   Scone   14,9,6  10,1,18 5,3,1   7,2,6   12,6,8  12,2,10 12,7,2