Question

我使用以下内容从API中提取数据：

url = "http://sitename"
response = requests.get(url)
data = response.text
print (data)

我得到原始xml的输出，下面是浏览器输出：

<projects count="8" href="/httpAuth/app/rest/projects/">
<project id="_Root" name="" description="" href="" webUrl=""/>
<project id="_Root1" name="" description="" href="" webUrl=""/>
<project id="_Root2" name="" description="" href="" webUrl=""/>
<project id="_Root3" name="" description="" href="" webUrl=""/>
<project id="_Root4" name="" description="" href="" webUrl=""/>
<project id="_Root5" name="" description="" href="" webUrl=""/>
<project id="_Root6" name="" description="" href="" webUrl=""/>
<project id="_Root7" name="" description="" href="" webUrl=""/>
</projects>

如何将每个行信息转换为可用的表单，例如循环遍历列表中的每个项目ID我将每个项目的id / name / desc / href拉出来并存储它？

我尝试在requests.get（）的accept headers部分中转换为json，但它仍然吐出xml数据，所以我认为我不能使用这个内容结构。

Answer 1

我使用lxml。

import requests
from lxml import etree

url = "http://sitename"
response = requests.get(url)
data = response.text
tree = etree.fromstring(data)
for leaf in tree:
    print(leaf.tag, leaf.attrib['id'], leaf.attrib['name'],
          leaf.attrib['description'], leaf.attrib['href'],
          leaf.attrib['webUrl'])

这给了你：

project _Root
project _Root1
project _Root2
project _Root3
project _Root4
project _Root5
project _Root6
project _Root7

Answer 2

对于结构良好的xml文件，您可以使用（@Adam Smith）lxml，这是一个非常有名的库，用于解析xml数据。

例如，解析您提到的数据将采用以下代码段：

>> from lxml import etree
>> root = etree.fromstring(s) # your input string in question
>> for element in root.getchildren(): print element.items() # dict-like
[('id', '_Root'), ('name', ''), ('description', ''), ('href', ''), ('webUrl', '')]
[('id', '_Root1'), ('name', ''), ('description', ''), ('href', ''), ('webUrl', '')]
[('id', '_Root2'), ('name', ''), ('description', ''), ('href', ''), ('webUrl', '')]
[('id', '_Root3'), ('name', ''), ('description', ''), ('href', ''), ('webUrl', '')]
[('id', '_Root4'), ('name', ''), ('description', ''), ('href', ''), ('webUrl', '')]
[('id', '_Root5'), ('name', ''), ('description', ''), ('href', ''), ('webUrl', '')]
[('id', '_Root6'), ('name', ''), ('description', ''), ('href', ''), ('webUrl', '')]
[('id', '_Root7'), ('name', ''), ('description', ''), ('href', ''), ('webUrl', '')]

现在，一个已知的问题（？）如果你的xml文件以某种方式被破坏，比如在关闭标记中遗漏了一个字符，那么lxml就不起作用了。好吧，它不应该。

在这种情况下，您需要在正则表达式（正则表达式）上进行中继，即Python的re模块。损坏的数据将迫使您编写并编译自己的正则表达式。例如，根据您拥有的数据，您可以使用以下正则表达式：

(?:\<project id="(\w*?)" name="(\w*?)" description="(\w*?)" href="(\w*?)" webUrl="(\w*?)"\/\>)

这将为每个匹配提取五个组，每个匹配包含一行，而每个组对应一个属性，该属性可以是空字符串。有关详细信息，请查看Python Doc。另外this site是用于原型设计/测试正则表达式的好工具。

Python request.get（）循环响应？

2 个答案: