我有一个带有href链接的XML页面,该页面将我带到下一页,而最后一个XML页面没有href元素。我需要递归下载所有XML,并搜索相关的Python代码,这可以帮助我快速执行此任务。
有帮助吗?
答案 0 :(得分:0)
您可以使用以下代码,对从连续页面获得的href进行收集,访问或做任何您想做的事情:
import xml.etree.ElementTree as ET
import os
import requests
from requests.auth import HTTPBasicAuth
def iterate_xml_automate(link):
#Parent page parsing
all_href = []
all_href.append(link)
tree = ET.fromstring(requests.get(link,
auth= HTTPBasicAuth('login', 'Password')).text.encode('utf-8')) # Parser object
#accessing href component from the XML tree
href = [link.attrib['href'] for link in tree.iter('link')]
all_href.append(href)
#Run the while loop till you find a href element in the successive xml file
while (len(href)!= 0):
tree_1 = ET.fromstring(requests.get(str(href[0]),
auth=HTTPBasicAuth('login', 'Password')).text.encode('utf-8'))
#Update href for accessing next xml link
href = [link.attrib['href'] for link in tree_1.iter('link')]
all_href.appned(href)
#Returns all the href from subsequent pages
return href