使用请求和lxml从Goodreads API读取XML的尝试失败

时间:2018-08-28 13:52:38

标签: python xml api python-requests lxml

Goodreads声称我可以得到以名为<GoodreadsResponse>的根开头的XML,该根的第一个子级为<book>,第8个子级为image_url。麻烦的是,我无法通过事件识别出正确的根(它输出root而不是GoodreadsResponse并且无法识别出根根本没有任何子级,尽管响应代码是200。我宁愿使用JSON,并且据说可以将其转换为JSON,但是我对此感到零运气。

这是我目前拥有的功能。我要去哪里错了?

def main(url, payload):
    """Retrieves image from Goodreads API endpoint returning XML response"""
    res = requests.get(url, payload)
    status = res.status_code
    print(status)
    parser = etree.XMLParser(recover=True)
    tree = etree.fromstring(res.content, parser=parser)
    root = etree.Element("root")
    print(root.text)

if __name__ == '__main__':
    main("https://www.goodreads.com/book/isbn/", '{"isbns": "0441172717", "key": "my_key"}')

好书信息在这里:

**Get the reviews for a book given an ISBN**
Get an xml or json response that contains embed code for the iframe reviews widget that shows excerpts (first 300 characters) of the most popular reviews of a book for a given ISBN. The reviews are from all known editions of the book. 
URL: https://www.goodreads.com/book/isbn/ISBN?format=FORMAT    (sample url) 
HTTP method: GET 

2 个答案:

答案 0 :(得分:1)

目前,您收到的请求中包含HTML而非XML。 您需要设置所需的响应格式:https://www.goodreads.com/book/isbn/ISBN?format=FORMAT

您需要使用参数而不是有效负载: Constructing requests with URL Query String in Python

P.S。对于您正在执行的请求,您可以使用JSON。 https://www.goodreads.com/api/index#book.show_by_isbn

答案 1 :(得分:0)

这是最适合我的解决方案:

导入请求 从bs4导入BeautifulSoup

def main():
    key = 'myKey'
    isbn = '0441172717'
    url = 'https://www.goodreads.com/book/isbn/{}?key={}'.format(isbn, key)
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "lxml-xml")
    print(soup.find('image_url').text)

问题在于XML内容包装在标签中。使用Beautiful Soup的“ lxml-xml”解析器而不是“ lxml”保留了CDATA标记中包含的内容,并允许对其进行正确解析。