在硒中刮取特定的表格

时间:2016-09-19 08:05:07

标签: python selenium xpath web-scraping

我正试图在页面上的div内找到一个表格。

基本上这是我到目前为止的尝试:

{
  "type": "object",
  "title": "Comment",
  "properties": {
    "name": {
      "title": "Name",
      "type": "string"
    },
    "email": {
      "title": "Email",
      "type": "string",
      "pattern": "^\\S+@\\S+$",
      "description": "Email will be used for evil."
    },
    "comment": {
      "title": "Comment",
      "type": "string",
      "maxLength": 20,
      "validationMessage": "Don't be greedy!"
    }
  },
  "required": [
    "name",
    "email",
    "comment"
  ]
}

如果我使用像“stackoverflow”之类的参数运行脚本,我应该能够抓取这个网站:https://www.google.us/trends/explore?date=today%203-m&geo=US&q=stackoverflow

显然我在那里的xpath没有工作,程序没有打印任何东西,它只是空白。

我基本上需要该网站上显示的图表的值。这些值(和日期)在表格中,这是截图:

enter image description here

你能帮我找到正确的xpath表来在python上使用selenium检索这些值吗?

提前致谢!

1 个答案:

答案 0 :(得分:2)

您可以使用Xpath As Follow:

//div[@class="line-chart"]/div/div[1]/div/div/table/tbody/tr

在这里,我将优化我的答案并对代码进行一些更改,而不是它的工作。

# NOTE: Download the chromedriver driver
# Then move exe file on C:\Python27\Scripts
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import sys
from lxml.html import fromstring,tostring

driver = webdriver.Chrome()
driver.implicitly_wait(20)
'''
URL_start = "http://www.google.us/trends/explore?"
date = '&date=today%203-m' # Last 90 days
location = "&geo=US"
symbol = sys.argv[1]
query = 'q='+symbol
URL = URL_start+query+date+location
'''
driver.get("https://www.google.us/trends/explore?date=today%203-m&geo=US&q=stackoverflow")

table_trs = driver.find_elements_by_xpath('//div[@class="line-chart"]/div/div[1]/div/div/table/tbody/tr')

for tr in table_trs:
    #print tr.get_attribute("innerHTML").encode("UTF-8")

    td = tr.find_elements_by_xpath(".//td")
    if len(td)==2:
        print td[0].get_attribute("innerHTML").encode("UTF-8") +"\t"+td[1].get_attribute("innerHTML").encode("UTF-8")