从交互式图表中抓取数据

时间:2016-09-19 10:41:52

标签: python json selenium

有一个website带有几个交互式图表,我想从中提取数据。我之前在使用selenium webdriver的python中写了几个web scraper,但这似乎是一个不同的问题。我在stackoverflow上看了几个类似的问题。从那些看来,解决方案似乎可以直接从json文件下载数据。我查看了网站的源代码,并确定了几个json文件,但经过检查,它们似乎并不包含数据。

有谁知道如何从这些图表下载数据?特别是我对这个条形图感兴趣:.//*[@id='network_download']

由于

编辑:我应该补充一点,当我使用Firebug检查网站时,我发现可以使用以下格式获取数据。但这显然没有用,因为它没有包含任何标签。

<circle fill="#8CB1AA" cx="713.4318516666667" cy="5.357142857142858" r="4.5" style="opacity: 0.983087;">
<circle fill="#8CB1AA" cx="694.1212663333334" cy="10.714285714285715" r="4.5" style="opacity: 0.983087;">
<circle fill="#CEA379" cx="626.4726493333333" cy="16.071428571428573" r="4.5" style="opacity: 0.983087;">
<circle fill="#B0B359" cx="613.88416" cy="21.42857142857143" r="4.5" style="opacity: 0.983087;">
<circle fill="#D1D49E" cx="602.917665" cy="26.785714285714285" r="4.5" style="opacity: 0.983087;">
<circle fill="#A5E0B5" cx="581.5437366666666" cy="32.142857142857146" r="4.5" style="opacity: 0.983087;">

1 个答案:

答案 0 :(得分:3)

像这样的SVG图表往往有点难以刮擦。在您使用鼠标实际悬停各个元素之前,您想要的数字不会显示。

获取您需要的数据

  1. 查找所有点的列表
  2. 对于dots_list中的每个点,单击或悬停(动作链)点
  3. 在弹出的工具提示中删除值
  4. 这对我有用:

    from __future__ import print_function
    
    from pprint import pprint as pp
    
    from selenium import webdriver
    from selenium.webdriver.common.action_chains import ActionChains
    
    
    def main():
        driver = webdriver.Chrome()
        ac = ActionChains(driver)
    
        try:
            driver.get("https://opensignal.com/reports/2016/02/state-of-lte-q4-2015/")
    
            dots_css = "div#network_download g g.dots_container circle"
            dots_list = driver.find_elements_by_css_selector(dots_css)
    
            print("Found {0} data points".format(len(dots_list)))
    
            download_speeds = list()
            for index, _ in enumerate(dots_list, 1):
                # Because this is an SVG chart, and because we need to hover it,
                # it is very likely that the elements will go stale as we do this. For
                # that reason we need to require each dot element right before we click it
                single_dot_css = dots_css + ":nth-child({0})".format(index)
                dot = driver.find_element_by_css_selector(single_dot_css)
                dot.click()
    
                # Scrape the text from the popup
                popup_css = "div#network_download div.tooltip"
                popup_text = driver.find_element_by_css_selector(popup_css).text
                pp(popup_text)
                rank, comp_and_country, speed = popup_text.split("\n")
                company, country = comp_and_country.split(" in ")
                speed_dict = {
                    "rank": rank.split(" Globally")[0].strip("#"),
                    "company": company,
                    "country": country,
                    "speed": speed.split("Download speed: ")[1]
                }
                download_speeds.append(speed_dict)
    
                # Hover away from the tool tip so it clears
                hover_elem = driver.find_element_by_id("network_download")
                ac.move_to_element(hover_elem).perform()
    
            pp(download_speeds)
    
        finally:
            driver.quit()
    
    if __name__ == "__main__":
        main()
    

    示例输出:

    (.venv35) ➜  stackoverflow python svg_charts.py
    Found 182 data points
    '#1 Globally\nSingTel in Singapore\nDownload speed: 40 Mbps'
    '#2 Globally\nStarHub in Singapore\nDownload speed: 39 Mbps'
    '#3 Globally\nSaskTel in Canada\nDownload speed: 35 Mbps'
    '#4 Globally\nOrange in Israel\nDownload speed: 35 Mbps'
    '#5 Globally\nolleh in South Korea\nDownload speed: 34 Mbps'
    '#6 Globally\nVodafone in Romania\nDownload speed: 33 Mbps'
    '#7 Globally\nVodafone in New Zealand\nDownload speed: 32 Mbps'
    '#8 Globally\nTDC in Denmark\nDownload speed: 31 Mbps'
    '#9 Globally\nT-Mobile in Hungary\nDownload speed: 30 Mbps'
    '#10 Globally\nT-Mobile in Netherlands\nDownload speed: 30 Mbps'
    '#11 Globally\nM1 in Singapore\nDownload speed: 29 Mbps'
    '#12 Globally\nTelstra in Australia\nDownload speed: 29 Mbps'
    '#13 Globally\nTelenor in Hungary\nDownload speed: 29 Mbps'
    <...>
    [{'company': 'SingTel',
      'country': 'Singapore',
      'rank': '1',
      'speed': '40 Mbps'},
     {'company': 'StarHub',
      'country': 'Singapore',
      'rank': '2',
      'speed': '39 Mbps'},
     {'company': 'SaskTel', 'country': 'Canada', 'rank': '3', 'speed': '35 Mbps'}
    ...
    ]
    

    应该注意的是,你在问题中引用的圆圈元素中的值并不是特别有用,因为它们只是指定了如何在SVG图表中绘制点。