我试图抓取此页面底部的馆藏表,以获取每栏中的信息:https://www.sec.gov/Archives/edgar/data/1412093/000114036111027807/0001140361-11-027807.txt
到目前为止我所拥有的是:
from bs4 import BeautifulSoup
import urllib2
import datetime
import sys
def scrape(url):
htmlfile = urllib2.urlopen(url)
htmltext = htmlfile.read()
bs = BeautifulSoup(htmltext)
tables =bs.find_all('table')
for table in tables:
print table
if __name__ == '__main__':
url = 'https://www.sec.gov/Archives/edgar/data/1412093/000114036111027807/0001140361-11-027807.txt'
scrape(url)
然而,这只能让我获得一席之地,而我似乎无法进一步逐行解析它。 任何有关这方面的帮助将不胜感激,谢谢!
答案 0 :(得分:0)
问题在于,这不是HTML表,而是以空格分隔的列集,您必须以不同方式进行解析。这是一个非常天真但有效的解决方案,使用splitlines()
将表格拆分为行,split()
拆分成列:
import urllib2
from bs4 import BeautifulSoup
def scrape(url):
htmlfile = urllib2.urlopen(url)
htmltext = htmlfile.read()
bs = BeautifulSoup(htmltext, "html.parser")
data = bs.find('table').get_text().splitlines()[10:]
for line in data:
print([item for item in line.split()])
if __name__ == '__main__':
url = 'https://www.sec.gov/Archives/edgar/data/1412093/000114036111027807/0001140361-11-027807.txt'
scrape(url)
打印:
['ADVENTRX', 'PHARMAMACEUTICALS', 'INC', 'COM', 'NEW', '00764X202', '289', '138,377', 'SH', 'SOLE', 'N/A', '138,377']
['AMGEN', 'INC', 'COM', '31162100', '54,519', '1,020,000', 'SH', 'SOLE', 'N/A', '1,020,000']
...
['SOUTHERN', 'UN', 'CO', 'NEW', 'COM', '844030106', '5,328', '186,154', 'SH', 'SOLE', 'N/A', '186,154']
['TAKE-TWO', 'INTERACTIVE', 'SOFTWAR', 'COM', '874054109', '151,310', '9,844,502', 'SH', 'SOLE', 'N/A', '9,844,502']
最不可靠的部分是[10:]
切片。我离开这个让你改进。