我想访问子节点中存在的信息。这是因为文件的结构吗?
试图分别提取文件中的作者子节点信息并运行python代码。很好
import urllib
import xml.etree.ElementTree as ET
url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml'
print 'Retrieving', url
document = urllib.urlopen (url).read()
print 'Retrieved', len(document), 'characters.'
print document[:50]
tree = ET.fromstring(document)
lst = tree.findall('title')
print lst[:100]
答案 0 :(得分:1)
您可以使用xmltodict来从请求的XML数据生成python字典。
这是一个基本示例:
import urllib2
import xmltodict
def foobar(request):
file = urllib2.urlopen('https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml')
data = file.read()
file.close()
data = xmltodict.parse(data)
return {'xmldata': data}
答案 1 :(得分:0)
我通常更喜欢使用beautifulsoup 和{l1, l3}
解析器来解析xml。
下面的示例代码
lxml
输出
import requests
from bs4 import BeautifulSoup
url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml'
document = requests.get(url)
soup= BeautifulSoup(document.content,"lxml-xml")
print (soup.find("title"))
然后您可以使用beautifulsoup提供的方法,例如<title>These highlights do not include all the information needed to use ZOLOFT safely and effectively. See full prescribing information for ZOLOFT. <br/>
<br/>ZOLOFT (sertraline hydrochloride) tablets, for oral use <br/>ZOLOFT (sertraline hydrochloride) oral solution <br/>Initial U.S. Approval: 1991</title>
和find
来找到相应的节点或子节点
答案 2 :(得分:0)
由于名称空间,您找不到标题元素。
在示例代码下面找到:
import xml.etree.ElementTree as ET
import urllib.request
url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml'
response = urllib.request.urlopen(url).read()
tree = ET.fromstring(response)
for docTitle in tree.findall('{urn:hl7-org:v3}title'):
print(docTitle.text)
for compTitle in tree.findall('.//{urn:hl7-org:v3}title'):
print(compTitle.text)
更新
如果您需要搜索XML节点,则应使用xPath Expressions
示例:
NS = '{urn:hl7-org:v3}'
ID = '829076996' # ID TO BE FOUND
# XPATH TO FIND AUTHORS BY ID (search ID and return related author node)
xPathAuthorById = ''.join([
".//",
NS, "author/",
NS, "assignedEntity/",
NS, "representedOrganization/",
NS, "id[@extension='", ID,
"']/../../.."
])
# XPATH TO FIND AUTHOR NAME ELEMENT
xPathAuthorName = ''.join([
"./",
NS, "assignedEntity/",
NS, "representedOrganization/",
NS, "name"
])
# FOR EACH AUTHOR FOUND, SEARCH ATTRIBUTES (example name)
for author in tree.findall(xPathAuthorById):
name = author.find(xPathAuthorName)
print(name.text)
此示例显示ID 829076996的作者姓名
更新2
您可以使用findall方法轻松处理所有AssignedEntity标签。 对于它们每个,您都可以拥有多个产品,因此需要另一种findall方法(请参见下面的示例)。
xPathAssignedEntities = ''.join([
".//",
NS, "author/",
NS, "assignedEntity/",
NS, "representedOrganization/",
NS, "assignedEntity/",
NS, "assignedOrganization/",
NS, "assignedEntity"
])
xPathProdCode = ''.join([
NS, "actDefinition/",
NS, "product/",
NS, "manufacturedProduct/",
NS, "manufacturedMaterialKind/",
NS, "code"
])
# GET ALL assignedEntity TAGS
for assignedEntity in tree.findall(xPathAssignedEntities):
# GET ID AND NAME OF assignedEntity
id = assignedEntity.find(NS + 'assignedOrganization/'+ NS + 'id').get('extension')
name = assignedEntity.find(NS + 'assignedOrganization/' + NS + 'name').text
# FOR EACH assignedEntity WE CAN HAVE MULTIPLE <performance> TAGS
for performance in assignedEntity.findall(NS + 'performance'):
actCode = performance.find(NS + 'actDefinition/'+ NS + 'code').get('displayName')
prodCode = performance.find(xPathProdCode).get('code')
print(id, '\t', name, '\t', actCode, '\t', prodCode)
这是结果:
829084545 Pfizer Pharmaceuticals LLC ANALYSIS 0049-0050
829084545 Pfizer Pharmaceuticals LLC ANALYSIS 0049-4900
829084545 Pfizer Pharmaceuticals LLC ANALYSIS 0049-4910
829084545 Pfizer Pharmaceuticals LLC ANALYSIS 0049-4940
829084545 Pfizer Pharmaceuticals LLC ANALYSIS 0049-4960
829084545 Pfizer Pharmaceuticals LLC API MANUFACTURE 0049-0050
829084545 Pfizer Pharmaceuticals LLC API MANUFACTURE 0049-4900
829084545 Pfizer Pharmaceuticals LLC API MANUFACTURE 0049-4910
829084545 Pfizer Pharmaceuticals LLC API MANUFACTURE 0049-4940
829084545 Pfizer Pharmaceuticals LLC API MANUFACTURE 0049-4960
829084545 Pfizer Pharmaceuticals LLC MANUFACTURE 0049-4900
829084545 Pfizer Pharmaceuticals LLC MANUFACTURE 0049-4910
829084545 Pfizer Pharmaceuticals LLC MANUFACTURE 0049-4960
829084545 Pfizer Pharmaceuticals LLC PACK 0049-4900
829084545 Pfizer Pharmaceuticals LLC PACK 0049-4910
829084545 Pfizer Pharmaceuticals LLC PACK 0049-4960
618054084 Pharmacia and Upjohn Company LLC ANALYSIS 0049-0050
618054084 Pharmacia and Upjohn Company LLC ANALYSIS 0049-4940
829084552 Pfizer Pharmaceuticals LLC PACK 0049-4900
829084552 Pfizer Pharmaceuticals LLC PACK 0049-4910
829084552 Pfizer Pharmaceuticals LLC PACK 0049-4960