我想抓取SEC Edgar 13F表单(txt格式)并将其解析为pandas.DataFrame
,
原始数据链接:https://www.sec.gov/Archives/edgar/data/1067983/000119312512060928/0001193125-12-060928.txt
我尝试使用bs4提取表,如下所示:
from bs4 import BeautifulSoup
def get_page(url):
url_client = urlopen(url)
page = url_client.read()
url_client.close()
return page
history_url = 'https://www.sec.gov/Archives/edgar/data/1067983/000119312513060317/0001193125-13-060317.txt'
txt_soup = BeautifulSoup(getPage(history_url), 'xml')
然后我从汤中提取桌子:
table = txt_soup.find_all('TABLE')[0]
table_header = table.contents[1].contents[0]
table_data = table.contents[1].contents[1]
table_data
看起来像这样:
<S> <C> <C> <C> <C> <C> <C> <C> <C> <C>
AMERICAN
EXPRESS CO COM 025816109 112,209 1,952,142 Shared-Defined 4 1,952,142 - -
AMERICAN
EXPRESS CO COM 025816109 990,116 17,225,400 Shared-Defined 4, 5 17,225,400 - -
AMERICAN
EXPRESS CO COM 025816109 48,274 839,832 Shared-Defined 4, 7 839,832 - -
AMERICAN
EXPRESS CO COM 025816109 111,689 1,943,100 Shared-Defined 4, 8, 11 1,943,100 - -
AMERICAN
EXPRESS CO COM 025816109 459,532 7,994,634 Shared-Defined 4, 10 7,994,634 - -
AMERICAN
EXPRESS CO COM 025816109 6,912,308 120,255,879 Shared-Defined 4, 11 120,255,879 - -
AMERICAN
EXPRESS CO COM 025816109 80,456 1,399,713 Shared-Defined 4, 13 1,399,713 - -
ARCHER DANIELS
MIDLAND CO COM 039483102 163,151 5,956,600 Shared-Defined 4, 5 5,956,600 - -
现在,我想将此str转换为pandas.DataFrame
,我尝试使用:
from io import StringIO
pd.read_csv(StringIO(table_data.text), header=None)
上面的代码失败,并返回错误:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 6
如何正确解析此类txt表?有更好的方法吗?
答案 0 :(得分:0)
我对Pandas Dataframes不太了解,但是通过查看代码,我相信我知道问题出在哪里。
第3行:
EXPRESS CO COM 025816109 112,209 1,952,142 Shared-Defined 4 1,952,142 - -
似乎用逗号分隔数据(因为csv文件通常使用逗号定界符)。 因此,与其传递一个字段,不传递六个字段:
EXPRESS CO COM 025816109 112
209 1
952
142 Shared-Defined 4 1
952
142 - -
我建议的解决方案是从table_data中删除所有逗号:
table_data = table_data.replace(',', '')
然后重试。请让我知道这是怎么回事!