Question

有一个website带有几个交互式图表，我想从中提取数据。我之前在使用selenium webdriver的python中写了几个web scraper，但这似乎是一个不同的问题。我在stackoverflow上看了几个类似的问题。从那些看来，解决方案似乎可以直接从json文件下载数据。我查看了网站的源代码，并确定了几个json文件，但经过检查，它们似乎并不包含数据。

有谁知道如何从这些图表下载数据？特别是我对这个条形图感兴趣：.//*[@id='network_download']

由于

编辑：我应该补充一点，当我使用Firebug检查网站时，我发现可以使用以下格式获取数据。但这显然没有用，因为它没有包含任何标签。

<circle fill="#8CB1AA" cx="713.4318516666667" cy="5.357142857142858" r="4.5" style="opacity: 0.983087;">
<circle fill="#8CB1AA" cx="694.1212663333334" cy="10.714285714285715" r="4.5" style="opacity: 0.983087;">
<circle fill="#CEA379" cx="626.4726493333333" cy="16.071428571428573" r="4.5" style="opacity: 0.983087;">
<circle fill="#B0B359" cx="613.88416" cy="21.42857142857143" r="4.5" style="opacity: 0.983087;">
<circle fill="#D1D49E" cx="602.917665" cy="26.785714285714285" r="4.5" style="opacity: 0.983087;">
<circle fill="#A5E0B5" cx="581.5437366666666" cy="32.142857142857146" r="4.5" style="opacity: 0.983087;">

Answer 1

像这样的SVG图表往往有点难以刮擦。在您使用鼠标实际悬停各个元素之前，您想要的数字不会显示。

获取您需要的数据

查找所有点的列表
对于dots_list中的每个点，单击或悬停（动作链）点
在弹出的工具提示中删除值

这对我有用：

from __future__ import print_function

from pprint import pprint as pp

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains


def main():
    driver = webdriver.Chrome()
    ac = ActionChains(driver)

    try:
        driver.get("https://opensignal.com/reports/2016/02/state-of-lte-q4-2015/")

        dots_css = "div#network_download g g.dots_container circle"
        dots_list = driver.find_elements_by_css_selector(dots_css)

        print("Found {0} data points".format(len(dots_list)))

        download_speeds = list()
        for index, _ in enumerate(dots_list, 1):
            # Because this is an SVG chart, and because we need to hover it,
            # it is very likely that the elements will go stale as we do this. For
            # that reason we need to require each dot element right before we click it
            single_dot_css = dots_css + ":nth-child({0})".format(index)
            dot = driver.find_element_by_css_selector(single_dot_css)
            dot.click()

            # Scrape the text from the popup
            popup_css = "div#network_download div.tooltip"
            popup_text = driver.find_element_by_css_selector(popup_css).text
            pp(popup_text)
            rank, comp_and_country, speed = popup_text.split("\n")
            company, country = comp_and_country.split(" in ")
            speed_dict = {
                "rank": rank.split(" Globally")[0].strip("#"),
                "company": company,
                "country": country,
                "speed": speed.split("Download speed: ")[1]
            }
            download_speeds.append(speed_dict)

            # Hover away from the tool tip so it clears
            hover_elem = driver.find_element_by_id("network_download")
            ac.move_to_element(hover_elem).perform()

        pp(download_speeds)

    finally:
        driver.quit()

if __name__ == "__main__":
    main()

示例输出：

(.venv35) ➜  stackoverflow python svg_charts.py
Found 182 data points
'#1 Globally\nSingTel in Singapore\nDownload speed: 40 Mbps'
'#2 Globally\nStarHub in Singapore\nDownload speed: 39 Mbps'
'#3 Globally\nSaskTel in Canada\nDownload speed: 35 Mbps'
'#4 Globally\nOrange in Israel\nDownload speed: 35 Mbps'
'#5 Globally\nolleh in South Korea\nDownload speed: 34 Mbps'
'#6 Globally\nVodafone in Romania\nDownload speed: 33 Mbps'
'#7 Globally\nVodafone in New Zealand\nDownload speed: 32 Mbps'
'#8 Globally\nTDC in Denmark\nDownload speed: 31 Mbps'
'#9 Globally\nT-Mobile in Hungary\nDownload speed: 30 Mbps'
'#10 Globally\nT-Mobile in Netherlands\nDownload speed: 30 Mbps'
'#11 Globally\nM1 in Singapore\nDownload speed: 29 Mbps'
'#12 Globally\nTelstra in Australia\nDownload speed: 29 Mbps'
'#13 Globally\nTelenor in Hungary\nDownload speed: 29 Mbps'
<...>
[{'company': 'SingTel',
  'country': 'Singapore',
  'rank': '1',
  'speed': '40 Mbps'},
 {'company': 'StarHub',
  'country': 'Singapore',
  'rank': '2',
  'speed': '39 Mbps'},
 {'company': 'SaskTel', 'country': 'Canada', 'rank': '3', 'speed': '35 Mbps'}
...
]

应该注意的是，你在问题中引用的圆圈元素中的值并不是特别有用，因为它们只是指定了如何在SVG图表中绘制点。

从交互式图表中抓取数据

1 个答案: