熊猫从字符串加载数据帧

时间:2018-10-24 04:16:27

标签: python pandas csv dataframe data-cleaning

我想抓取SEC Edgar 13F表单(txt格式)并将其解析为pandas.DataFrame

原始数据链接:https://www.sec.gov/Archives/edgar/data/1067983/000119312512060928/0001193125-12-060928.txt

我尝试使用bs4提取表,如下所示:

from bs4 import BeautifulSoup

def get_page(url):
    url_client = urlopen(url)
    page = url_client.read()
    url_client.close()
    return page

history_url = 'https://www.sec.gov/Archives/edgar/data/1067983/000119312513060317/0001193125-13-060317.txt'

txt_soup = BeautifulSoup(getPage(history_url), 'xml')

然后我从汤中提取桌子:

table = txt_soup.find_all('TABLE')[0]
table_header = table.contents[1].contents[0]
table_data = table.contents[1].contents[1]

table_data看起来像这样:

<S> <C> <C> <C> <C> <C> <C> <C> <C> <C>
AMERICAN
  EXPRESS CO        COM           025816109      112,209     1,952,142 Shared-Defined 4           1,952,142       -        -
AMERICAN
  EXPRESS CO        COM           025816109      990,116    17,225,400 Shared-Defined 4, 5       17,225,400       -        -
AMERICAN
  EXPRESS CO        COM           025816109       48,274       839,832 Shared-Defined 4, 7          839,832       -        -
AMERICAN
  EXPRESS CO        COM           025816109      111,689     1,943,100 Shared-Defined 4, 8, 11    1,943,100       -        -
AMERICAN
  EXPRESS CO        COM           025816109      459,532     7,994,634 Shared-Defined 4, 10       7,994,634       -        -
AMERICAN
  EXPRESS CO        COM           025816109    6,912,308   120,255,879 Shared-Defined 4, 11     120,255,879       -        -
AMERICAN
  EXPRESS CO        COM           025816109       80,456     1,399,713 Shared-Defined 4, 13       1,399,713       -        -
ARCHER DANIELS
  MIDLAND CO        COM           039483102      163,151     5,956,600 Shared-Defined 4, 5        5,956,600       -        -

现在,我想将此str转换为pandas.DataFrame,我尝试使用:

from io import StringIO
pd.read_csv(StringIO(table_data.text), header=None)

上面的代码失败,并返回错误:

ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 6

如何正确解析此类txt表?有更好的方法吗?

1 个答案:

答案 0 :(得分:0)

我对Pandas Dataframes不太了解,但是通过查看代码,我相信我知道问题出在哪里。

第3行:

EXPRESS CO        COM           025816109      112,209     1,952,142 Shared-Defined 4           1,952,142       -        -

似乎用逗号分隔数据(因为csv文件通常使用逗号定界符)。  因此,与其传递一个字段,不传递六个字段:

EXPRESS CO        COM           025816109      112
209     1
952
142 Shared-Defined 4           1
952
142       -        -

我建议的解决方案是从table_data中删除所有逗号:

table_data = table_data.replace(',', '')

然后重试。请让我知道这是怎么回事!