Question

设置

我需要获取this Wikipedia page上所有NUTS3区域的种群数据。

我已经获得了每个NUTS3区域的所有URL，并将让Selenium遍历它们以获取每个区域的人口数量，如其页面上所示。

也就是说，对于每个区域，我都需要在其infobox geography vcard元素中显示人口。例如。对于this region，人口将为591680。

代码

在编写循环之前，我试图获取一个地区的人口，

url = 'https://en.wikipedia.org/wiki/Arcadia'

browser.get(url)

vcard_element = browser.find_element_by_css_selector('#mw-content-text > div > table.infobox.geography.vcard').find_element_by_xpath('tbody')

for row in vcard_element.find_elements_by_xpath('tr'):

    try:
        if 'Population' in row.find_element_by_xpath('th').text:
            print(row.find_element_by_xpath('th').text)
    except Exception:
        pass

问题

该代码有效。也就是说，它将打印包含单词“ Population”的行。

问题：我如何告诉Selenium获取下一行-包含实际人口数量的行？

Answer 1

使用./following::tr[1]或./following-sibling::tr[1]

url = 'https://en.wikipedia.org/wiki/Arcadia'
browser=webdriver.Chrome()
browser.get(url)

vcard_element = browser.find_element_by_css_selector('#mw-content-text > div > table.infobox.geography.vcard').find_element_by_xpath('tbody')

for row in vcard_element.find_elements_by_xpath('tr'):

    try:
        if 'Population' in row.find_element_by_xpath('th').text:
            print(row.find_element_by_xpath('th').text)
            print(row.find_element_by_xpath('./following::tr[1]').text) #whole word
            print(row.find_element_by_xpath('./following::tr[1]/td').text) #Only number
    except Exception:
        pass

控制台上的输出：

Population (2011)
 • Total 86,685
86,685

Answer 2

虽然您当然可以使用selenium做到这一点，但我个人还是建议使用request和lxml，因为它们比selenium轻得多，并且可以很好地完成工作。我发现以下内容适用于我测试过的几个区域：

library(dplyr)
tibble(col = df) %>% 
         count(col)

本质上，html.fromstring（）。xpath（）从路径上的try: response = requests.get(url) infocard_rows = html.fromstring(response.content).xpath("//table[@class='infobox geography vcard']/tbody/tr") except: print('Error retrieving information from ' + url) try: population_row = 0 for i in range(len(infocard_rows)): if infocard_rows[i].findtext('th') == 'Population': population_row = i+1 break population = infocard_rows[population_row].findtext('td') except: print('Unable to find population')表中获取所有行。然后，下一个try-catch仅尝试查找内部文本为infobox geography vcard的{{1}}，然后从下一个th（即人口数）中提取文本。

希望这是有帮助的，即使它不是您所要求的硒！如果要重新创建浏览器行为或检查javascript元素，通常会使用Selenium。当然，您也可以在这里使用它。

根据值当前行Selenium

2 个答案: