如何使用Selenium和Python从HTML的span标签中提取文本

时间:2018-08-17 17:04:28

标签: python selenium selenium-webdriver xpath web-scraping

我正在寻找从这三个标签中的span / div标签之间获取以下信息。

<span class="engagementInfo-valueNumber js-countValue">496.26K</span>

<div class="websiteRanks-valueContainer js-websiteRanksValue">
        <span class="websiteRanks-valueChange websiteRanks-valueChange--isSingleMode websiteRanks-valueChange--up"></span>
    180
</div>

<span class="websitePage-relativeChangeNumber">16.35%</span>

当我复制xpath时,结果如下:

/html/body/div[1]/main/div/div/div[2]/div[2]/div[1]/div[3]/div/div/div/div[2]/div/span[2]/span[2]/span

并复制选择器会产生:

body > div.wrapper-body.wrapperBody--websiteAnalysis.js-wrapperBody > main > div > div > div.analysisPage-section.analysisPage-section--withFeedback.websitePage-overview.js-section.js-showInCompare.is-active.js-triggered > div.analysisPage-sectionContent.analysisPage-sectionVisits.js-sectionContent.js-print-pageFooter.is-triggered > div.u-clearfix.analysisPage-sectionOverview > div.websitePage-mobileFramed.websitePage-mobileFramed--overview > div > div > div > div:nth-child(2) > div > span.engagementInfo-value.engagementInfo-value--large.u-text-ellipsis > span.engagementInfo-valueRelative.websitePage-relativeChange.websitePage-relativeChange--delay.websitePage-relativeChange--up.js-showOnCount.is-shown > span

最后,我希望使用496.26K18016.35%或列表中的一些元素。

尽管过去它对我来说对其他网站也有用,但我尝试了以下方法但没有成功:

url = 'https://www.similarweb.com/website/' + domain
        driver.get(url) #get response
        driver.implicitly_wait(2) #wait to load content
        total_vists = driver.find_element_by_xpath(xpath='/html/body/div[1]/main/div/div/section[2]/div/ul/li[1]/div[2]').text

2 个答案:

答案 0 :(得分:1)

您可以尝试第一个范围的 css选择器

用于提取496.26K

first_span = driver.find_element_by_css_selector("span.engagementInfo-valueNumber.js-countValue").text  

print(first_span)  

用于提取180

second_span= driver.find_element_by_css_selector("span.websiteRanks-valueChange.websiteRanks-valueChange--isSingleMode.websiteRanks-valueChange--up")  

print(second_span.text)  

用于提取16.35%

third_span= driver.find_element_by_css_selector("span.websitePage-relativeChangeNumber")  

print(third_span.text) 

答案 1 :(得分:0)

由于元素是基于JavaScript的,因此您共享了 HTML ,因此您需要诱使 WebDriverWait 使元素可见< / em>,则可以使用以下解决方案:

  • Fatal error: Uncaught PDOException: SQLSTATE[HY000]: General error in \path\test.php:25 Stack trace: #0 {main} thrown in \path\test.php on line 25

    <?PHP foreach($db->query($sql) as $row){ ?>
  • 496.26K

    engagementInfo = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//span[@class='engagementInfo-valueNumber js-countValue']"))).get_attribute("innerHTML")
    
  • 180

    websiteRanks = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//span[@class='websiteRanks-valueContainer js-websiteRanksValue']")))
    websiteRanksText = driver.execute_script('return arguments[0].lastChild.textContent;', websiteRanks).strip()
    

注意:您必须添加以下导入:

16.35%