下面的函数从该URL-https://www.sec.gov/Archives/edgar/monthly/xbrlrss-2018-12.xml中提取xml。
请注意,XML包含很多'edgar:'。
在整个XML文件中查找“ edgar:”并替换为“ edgar_”的最简单方法是什么?
谢谢
import requests
import urllib.request as urllib2
import xml.etree.ElementTree as ET
from lxml import etree
def quarter_filing_urls(year, month):
url = "https://www.sec.gov/Archives/edgar/monthly/xbrlrss-" + str(year) + "-" + str(month) + ".xml"
tree = ET.parse(urllib2.urlopen(url))
root = tree.getroot()
return root
更新
一种选择是使用命名空间,如下所示。但是我尝试一下,我得到:'AttributeError:'set'对象没有属性'items'
def quarter_filing_urls(year, month):
url = "https://www.sec.gov/Archives/edgar/monthly/xbrlrss-" + str(year) + "-" + str(month) + ".xml"
tree = ET.parse(urllib2.urlopen(url))
root = tree.getroot()
filings = []
namespaces = {"edgar:xbrlFiling", 'rss'}
for item in root.findall("./channel/item/edgar:xbrlFiling/", namespaces):
filing = dict(item.attrib)
filings.append(filing)
return filings