这是我的代码:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.sec.gov/Archives/edgar/data/1288776/000119312512312575/goog-20120630.xml"
req = urllib2.Request(url, "r")
response = urllib2.urlopen(req)
xml = response.read()
soup = BeautifulSoup(xml, features="xml")
print soup.prettify()
输出仅显示目标的前几行XML:
>>>
<?xml version="1.0" encoding="utf-8"?>
<!-- EDGAR Online I-Metrix Xcelerate Instance Document, based on XBRL 2.1 http://www.edgar-online.com/ -->
<!-- Version: 6.17.6 -->
<!-- Round: 8321e8af-cc4a-498e-a38d-da694ed77a41 -->
<!-- Creation date: 2012-07-24T16:17:46Z -->
<xbrl xmlns="http://www.xbrl.org/2003/instance" xmlns:country="http://xbr" xmlns:iso4217="http://www.xbrl.org/2003/iso4217" xmlns:xbrll="http://www.xbrl.org/2003/linkbase" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"/>
如何提取所有XML?
答案 0 :(得分:0)
你尝试过使用开场白吗?
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.sec.gov/Archives/edgar/data/1288776/000119312512312575/goog-20120630.xml"
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
resource = opener.open(url)
data = resource.read()
resource.close()
soup = BeautifulSoup(data)
print soup.prettify()
上面的代码对我有用。
答案 1 :(得分:0)
我实际上只是遇到了这个,但是在我通过FTP从SEC网站获取完整的SGML文档后,从磁盘读取它。我有:
soup = bs4.BeautifulSoup(xbrl, ["lxml", "xml"])
我把它改为:
soup = bs4.BeautifulSoup(xbrl, "lxml")
...然后我就能得到所有的XML。我相信你的问题可能是BeautifulSoup函数调用中额外的'features =“xml”'代码?这与Inbar Rose的答案一致,它没有对BeautifulSoup()函数调用有任何额外的参数。
祝你好运!