使用特殊格式从URL结果中提取数据

时间:2010-10-29 22:31:04

标签: python parsing url

我有一个网址:
http://somewhere.com/relatedqueries?limit=2&query=seedterm

修改输入,限制和查询,将生成所需数据。限制是可能的最大术语数,查询是种子术语。

URL提供以这种方式格式化的文本结果:
oo.visualization.Query.setResponse({版本: '0.5',REQID: '0',状态: 'OK',SIG: '1303596067112929220',表:{COLS:[{ID: '得分',标签:“分数”,类型: '编号',图案: '#,## 0 ###'},{ID: '查询',标签: '查询',类型: '字符串',模式: ''}],行:[{C:[{ν:0.9894380670262618中,f: '0.99'},{ν: 'newterm1'}]},{C:[{ν:0.9894380670262618中,f: '0.99'},{ν: 'newterm2' }]}],p:{ 'totalResultsCount': '7727'}}});

我想编写一个带有两个参数(限制数和查询种子)的python脚本,在线获取数据,解析结果并返回一个包含新术语['newterm1','newterm2'的列表在这种情况下。

我喜欢一些帮助,尤其是URL提取,因为我之前从未这样做过。

2 个答案:

答案 0 :(得分:12)

听起来你可以把这个问题分解成几个子问题。

子问题

在编写完成的脚本之前,有一些问题需要解决:

  1. 形成请求网址:从模板创建配置的请求网址
  2. 检索数据:实际提出请求
  3. 展开JSONP :返回的数据似乎是JSON包含在JavaScript函数调用中
  4. 遍历对象图:浏览结果以找到所需的信息位
  5. 形成请求URL

    这只是简单的字符串格式化。

    url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}'
    url = url_template.format(limit=2, seedterm='seedterm')
    
      

    Python 2注意

         

    您需要在此处使用字符串格式化运算符(%)。

    url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s'
    url = url_template % dict(limit=2, seedterm='seedterm')
    

    检索数据

    您可以使用内置的urllib.request模块。

    import urllib.request
    data = urllib.request.urlopen(url) # url from previous section
    

    这将返回一个名为data的类文件对象。你也可以在这里使用with语句:

    with urllib.request.urlopen(url) as data:
        # do processing here
    
      

    Python 2注意

         

    导入urllib2而不是urllib.request

    展开JSONP

    您粘贴的结果看起来像JSONP。鉴于被调用的包装函数(oo.visualization.Query.setResponse)没有改变,我们可以简单地去掉这个方法调用。

    result = data.read()
    
    prefix = 'oo.visualization.Query.setResponse('
    suffix = ');'
    
    if result.startswith(prefix) and result.endswith(suffix):
        result = result[len(prefix):-len(suffix)]
    

    解析JSON

    生成的result字符串只是JSON数据。使用内置的json模块解析它。

    import json
    
    result_object = json.loads(result)
    

    遍历对象图

    现在,您有result_object代表JSON响应。该对象本身为dict,其中包含versionreqId等密钥。根据您的问题,您需要执行以下操作来创建列表。

    # Get the rows in the table, then get the second column's value for
    # each row
    terms = [row['c'][2]['v'] for row in result_object['table']['rows']]
    

    全部放在一起

    #!/usr/bin/env python3
    
    """A script for retrieving and parsing results from requests to
    somewhere.com.
    
    This script works as either a standalone script or as a library. To use
    it as a standalone script, run it as `python3 scriptname.py`. To use it
    as a library, use the `retrieve_terms` function."""
    
    import urllib.request
    import json
    import sys
    
    E_OPERATION_ERROR = 1
    E_INVALID_PARAMS = 2
    
    def parse_result(result):
        """Parse a JSONP result string and return a list of terms"""
        prefix = 'oo.visualization.Query.setResponse('
        suffix = ');'
    
        # Strip JSONP function wrapper
        if result.startswith(prefix) and result.endswith(suffix):
            result = result[len(prefix):-len(suffix)]
    
        # Deserialize JSON to Python objects
        result_object = json.loads(result)
    
        # Get the rows in the table, then get the second column's value
        # for each row
        return [row['c'][2]['v'] for row in result_object['table']['rows']]
    
    def retrieve_terms(limit, seedterm):
        """Retrieves and parses data and returns a list of terms"""
        url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}'
        url = url_template.format(limit=limit, seedterm=seedterm)
    
        try:
            with urllib.request.urlopen(url) as data:
                data = perform_request(limit, seedterm)
                result = data.read()
        except:
            print('Could not request data from server', file=sys.stderr)
            exit(E_OPERATION_ERROR)
    
        terms = parse_result(result)
        print(terms)
    
    def main(limit, seedterm):
        """Retrieves and parses data and prints each term to standard output"""
        terms = retrieve_terms(limit, seedterm)
        for term in terms:
            print(term)
    
    if __name__ == '__main__'
        try:
            limit = int(sys.argv[1])
            seedterm = sys.argv[2]
        except:
            error_message = '''{} limit seedterm
    
    limit must be an integer'''.format(sys.argv[0])
            print(error_message, file=sys.stderr)
            exit(2)
    
        exit(main(limit, seedterm))
    

    Python 2.7版

    #!/usr/bin/env python2.7
    
    """A script for retrieving and parsing results from requests to
    somewhere.com.
    
    This script works as either a standalone script or as a library. To use
    it as a standalone script, run it as `python2.7 scriptname.py`. To use it
    as a library, use the `retrieve_terms` function."""
    
    import urllib2
    import json
    import sys
    
    E_OPERATION_ERROR = 1
    E_INVALID_PARAMS = 2
    
    def parse_result(result):
        """Parse a JSONP result string and return a list of terms"""
        prefix = 'oo.visualization.Query.setResponse('
        suffix = ');'
    
        # Strip JSONP function wrapper
        if result.startswith(prefix) and result.endswith(suffix):
            result = result[len(prefix):-len(suffix)]
    
        # Deserialize JSON to Python objects
        result_object = json.loads(result)
    
        # Get the rows in the table, then get the second column's value
        # for each row
        return [row['c'][2]['v'] for row in result_object['table']['rows']]
    
    def retrieve_terms(limit, seedterm):
        """Retrieves and parses data and returns a list of terms"""
        url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s'
        url = url_template % dict(limit=2, seedterm='seedterm')
    
        try:
            with urllib2.urlopen(url) as data:
                data = perform_request(limit, seedterm)
                result = data.read()
        except:
            sys.stderr.write('%s\n' % 'Could not request data from server')
            exit(E_OPERATION_ERROR)
    
        terms = parse_result(result)
        print terms
    
    def main(limit, seedterm):
        """Retrieves and parses data and prints each term to standard output"""
        terms = retrieve_terms(limit, seedterm)
        for term in terms:
            print term
    
    if __name__ == '__main__'
        try:
            limit = int(sys.argv[1])
            seedterm = sys.argv[2]
        except:
            error_message = '''{} limit seedterm
    
    limit must be an integer'''.format(sys.argv[0])
            sys.stderr.write('%s\n' % error_message)
            exit(2)
    
        exit(main(limit, seedterm))
    

答案 1 :(得分:1)

我不太了解你的问题,因为在你的代码中我觉得你使用Visualization API(这是我第一次听到这个问题)。

但是,如果您只是在寻找从网页获取数据的方法,那么您可以使用urllib2这只是为了获取数据,如果您想要解析检索到的数据,则必须使用更合适的库,如BeautifulSoop

如果您正在处理另一个Web服务(RSS,Atom,RPC)而不是网页,您可以找到一堆可以使用的python库,并且可以完美地处理每个服务。

import urllib2

from BeautifulSoup import BeautifulSoup

result =  urllib2.urlopen('http://somewhere.com/relatedqueries?limit=%s&query=%s' % (2, 'seedterm'))

htmletxt = resul.read()

result.close()

soup = BeautifulSoup(htmltext, convertEntities="html" )

# you can parse your data now check BeautifulSoup API.