如何用python从网站上抓图表?

时间:2016-10-05 03:06:13

标签: python graph screen-scraping

编辑:

所以我将下面的脚本代码保存到文本文件中,但是使用re来提取数据仍然没有给我任何回报。我的代码是:

file_object = open('source_test_script.txt', mode="r")
soup = BeautifulSoup(file_object, "html.parser")
pattern = re.compile(r"^var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);$", re.MULTILINE | re.DOTALL)
scripts = soup.find("script", text=pattern)
profile_text = pattern.search(scripts.text).group(1)
profile = json.loads(profile_text)

print profile["data"], profile["categories"]

我想从网站上提取图表的数据。以下是图表的源代码。

  <script type="text/javascript">
    jQuery(function() {

    var chart1 = new Highcharts.Chart({

          chart: {
             renderTo: 'chart1',
              defaultSeriesType: 'column',
            borderWidth: 2
          },
          title: {
             text: 'Productions'
          },
          legend: {
            enabled: false
          },
          xAxis: [{
             categories: [1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016],

          }],
          yAxis: {
             min: 0,
             title: {
             text: 'Productions'
          }
          },

          series: [{
               name: 'Productions',
               data: [1,1,0,1,6,4,9,15,15,19,24,18,53,42,54,53,61,36]
               }]
       });
    });

    </script>

网站上有几个类似的图表,名为&#34; chart1&#34;,&#34; chart2&#34;等。我想提取以下数据:类别行和数据行,对于每个图表:

categories: [1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016]

data: [1,1,0,1,6,4,9,15,15,19,24,18,53,42,54,53,61,36]

2 个答案:

答案 0 :(得分:4)

另一种方法是在控制台中使用Highcharts的JavaScript库,然后使用Selenium。

import time
from selenium import webdriver

website = ""

driver = webdriver.Firefox()
driver.get(website)
time.sleep(5)

temp = driver.execute_script('return window.Highcharts.charts[0]'
                             '.series[0].options.data')
data = [item[1] for item in temp]
print(data)

根据您尝试提取的图表和系列,您的情况可能略有不同。

答案 1 :(得分:0)

我会使用正则表达式和yaml解析器的组合。下面的快速和脏 - 您可能需要调整正则表达式,但它适用于示例:

pip install PyYAML

需要yaml库(<script>),你应该使用BeautifulSoup在将它传递给正则表达式之前提取正确的<script>标记。

编辑 - 完整示例

抱歉,我没有说清楚。您使用BeautifulSoup来解析HTML并提取from bs4 import BeautifulSoup import yaml import re file_object = open('source_test_script.txt', mode="r") soup = BeautifulSoup(file_object, "html.parser") pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE) charts = {} # find every <script> tag in the source using beautifulsoup for tag in soup.find_all('script'): # tabs are special in yaml so remove them first script = tag.text.replace('\t', '') # find each object declaration for name, obj_declaration in pattern.findall(script): try: # parse the javascript declaration charts[name] = yaml.safe_load(obj_declaration) except Exception, e: print "Failed to parse {0}: {1}".format(name, e) # extract the data you want for name in charts: print "## {0} ##".format(name); print "categories:", charts[name]['xAxis'][0]['categories'] print "data:", charts[name]['series'][0]['data'] print 元素,然后使用PyYAML来解析javascript对象声明。你不能使用内置的json库,因为它不是有效的JSON,但普通的javascript对象声明(即没有函数)是YAML的一个子集。

## chart1 ##
categories: [1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016]
data: [22, 1, 0, 1, 6, 4, 9, 15, 15, 19, 24, 18, 53, 42, 54, 53, 61, 36]

输出:

from bs4 import BeautifulSoup
import json
import re

file_object = open('citec.repec.org_p_c_pcl20.html', mode="r")
soup = BeautifulSoup(file_object, "html.parser")

pattern = re.compile(r"var (chart[0-9]+) = new Highcharts.Chart\(({.*?})\);", re.MULTILINE | re.DOTALL | re.UNICODE)

charts = {}

for tag in soup.find_all('script'):

    # tabs are special in yaml so remove them first
    script = tag.text

    values = {}

    # find each object declaration
    for name, obj_declaration in pattern.findall(script):
        for line in obj_declaration.split('\n'):
            line = line.strip('\t\n ,;')
            for field in ('data', 'categories'):
                if line.startswith(field + ":"):
                    data = line[len(field)+1:]
                    try:
                        values[field] = json.loads(data)
                    except:
                        print "Failed to parse %r for %s" % (data, name)

        charts[name] = values

print charts

注意我必须调整正则表达式以使其处理来自BeautifulSoup的unicode输出和空格 - 在我的原始示例中,我只是将源代码直接传递给正则表达式。

编辑2 - 没有yaml

鉴于javascript看起来是部分生成的,你可以期待的最好的就是抓住线条 - 不是很优雅,但可能适合你。

{{1}}

请注意,它对图表7失败,因为它引用了另一个变量。