子问题

Question

我有一个网址：
http://somewhere.com/relatedqueries?limit=2&query=seedterm

修改输入，限制和查询，将生成所需数据。限制是可能的最大术语数，查询是种子术语。

URL提供以这种方式格式化的文本结果：
oo.visualization.Query.setResponse（{版本： '0.5'，REQID： '0'，状态： 'OK'，SIG： '1303596067112929220'，表：{COLS：[{ID： '得分'，标签：“分数”，类型： '编号'，图案： '＃，## 0 ###'}，{ID： '查询'，标签： '查询'，类型： '字符串'，模式： ''}]，行：[{C：[{ν：0.9894380670262618中，f： '0.99'}，{ν： 'newterm1'}]}，{C：[{ν：0.9894380670262618中，f： '0.99'}，{ν： 'newterm2' }]}]，p：{ 'totalResultsCount'： '7727'}}}）;

我想编写一个带有两个参数（限制数和查询种子）的python脚本，在线获取数据，解析结果并返回一个包含新术语['newterm1'，'newterm2'的列表在这种情况下。

我喜欢一些帮助，尤其是URL提取，因为我之前从未这样做过。

Answer 1

听起来你可以把这个问题分解成几个子问题。

子问题

在编写完成的脚本之前，有一些问题需要解决：

形成请求网址：从模板创建配置的请求网址
检索数据：实际提出请求
展开JSONP ：返回的数据似乎是JSON包含在JavaScript函数调用中
遍历对象图：浏览结果以找到所需的信息位

形成请求URL

这只是简单的字符串格式化。

url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}'
url = url_template.format(limit=2, seedterm='seedterm')

Python 2注意

您需要在此处使用字符串格式化运算符（%）。
url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s'
url = url_template % dict(limit=2, seedterm='seedterm')

检索数据

您可以使用内置的urllib.request模块。

import urllib.request
data = urllib.request.urlopen(url) # url from previous section

这将返回一个名为data的类文件对象。你也可以在这里使用with语句：

with urllib.request.urlopen(url) as data:
    # do processing here

Python 2注意

导入urllib2而不是urllib.request。

展开JSONP

您粘贴的结果看起来像JSONP。鉴于被调用的包装函数（oo.visualization.Query.setResponse）没有改变，我们可以简单地去掉这个方法调用。

result = data.read()

prefix = 'oo.visualization.Query.setResponse('
suffix = ');'

if result.startswith(prefix) and result.endswith(suffix):
    result = result[len(prefix):-len(suffix)]

解析JSON

生成的result字符串只是JSON数据。使用内置的json模块解析它。

import json

result_object = json.loads(result)

遍历对象图

现在，您有result_object代表JSON响应。该对象本身为dict，其中包含version，reqId等密钥。根据您的问题，您需要执行以下操作来创建列表。

# Get the rows in the table, then get the second column's value for
# each row
terms = [row['c'][2]['v'] for row in result_object['table']['rows']]

全部放在一起

#!/usr/bin/env python3

"""A script for retrieving and parsing results from requests to
somewhere.com.

This script works as either a standalone script or as a library. To use
it as a standalone script, run it as `python3 scriptname.py`. To use it
as a library, use the `retrieve_terms` function."""

import urllib.request
import json
import sys

E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2

def parse_result(result):
    """Parse a JSONP result string and return a list of terms"""
    prefix = 'oo.visualization.Query.setResponse('
    suffix = ');'

    # Strip JSONP function wrapper
    if result.startswith(prefix) and result.endswith(suffix):
        result = result[len(prefix):-len(suffix)]

    # Deserialize JSON to Python objects
    result_object = json.loads(result)

    # Get the rows in the table, then get the second column's value
    # for each row
    return [row['c'][2]['v'] for row in result_object['table']['rows']]

def retrieve_terms(limit, seedterm):
    """Retrieves and parses data and returns a list of terms"""
    url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}'
    url = url_template.format(limit=limit, seedterm=seedterm)

    try:
        with urllib.request.urlopen(url) as data:
            data = perform_request(limit, seedterm)
            result = data.read()
    except:
        print('Could not request data from server', file=sys.stderr)
        exit(E_OPERATION_ERROR)

    terms = parse_result(result)
    print(terms)

def main(limit, seedterm):
    """Retrieves and parses data and prints each term to standard output"""
    terms = retrieve_terms(limit, seedterm)
    for term in terms:
        print(term)

if __name__ == '__main__'
    try:
        limit = int(sys.argv[1])
        seedterm = sys.argv[2]
    except:
        error_message = '''{} limit seedterm

limit must be an integer'''.format(sys.argv[0])
        print(error_message, file=sys.stderr)
        exit(2)

    exit(main(limit, seedterm))

Python 2.7版

#!/usr/bin/env python2.7

"""A script for retrieving and parsing results from requests to
somewhere.com.

This script works as either a standalone script or as a library. To use
it as a standalone script, run it as `python2.7 scriptname.py`. To use it
as a library, use the `retrieve_terms` function."""

import urllib2
import json
import sys

E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2

def parse_result(result):
    """Parse a JSONP result string and return a list of terms"""
    prefix = 'oo.visualization.Query.setResponse('
    suffix = ');'

    # Strip JSONP function wrapper
    if result.startswith(prefix) and result.endswith(suffix):
        result = result[len(prefix):-len(suffix)]

    # Deserialize JSON to Python objects
    result_object = json.loads(result)

    # Get the rows in the table, then get the second column's value
    # for each row
    return [row['c'][2]['v'] for row in result_object['table']['rows']]

def retrieve_terms(limit, seedterm):
    """Retrieves and parses data and returns a list of terms"""
    url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s'
    url = url_template % dict(limit=2, seedterm='seedterm')

    try:
        with urllib2.urlopen(url) as data:
            data = perform_request(limit, seedterm)
            result = data.read()
    except:
        sys.stderr.write('%s\n' % 'Could not request data from server')
        exit(E_OPERATION_ERROR)

    terms = parse_result(result)
    print terms

def main(limit, seedterm):
    """Retrieves and parses data and prints each term to standard output"""
    terms = retrieve_terms(limit, seedterm)
    for term in terms:
        print term

if __name__ == '__main__'
    try:
        limit = int(sys.argv[1])
        seedterm = sys.argv[2]
    except:
        error_message = '''{} limit seedterm

limit must be an integer'''.format(sys.argv[0])
        sys.stderr.write('%s\n' % error_message)
        exit(2)

    exit(main(limit, seedterm))

Answer 2

我不太了解你的问题，因为在你的代码中我觉得你使用Visualization API（这是我第一次听到这个问题）。

但是，如果您只是在寻找从网页获取数据的方法，那么您可以使用urllib2这只是为了获取数据，如果您想要解析检索到的数据，则必须使用更合适的库，如BeautifulSoop

如果您正在处理另一个Web服务（RSS，Atom，RPC）而不是网页，您可以找到一堆可以使用的python库，并且可以完美地处理每个服务。

import urllib2

from BeautifulSoup import BeautifulSoup

result =  urllib2.urlopen('http://somewhere.com/relatedqueries?limit=%s&query=%s' % (2, 'seedterm'))

htmletxt = resul.read()

result.close()

soup = BeautifulSoup(htmltext, convertEntities="html" )

# you can parse your data now check BeautifulSoup API.

使用特殊格式从URL结果中提取数据

2 个答案:

子问题