将文本读入数据帧

时间:2016-02-21 05:59:49

标签: python pandas dataframe

如何将这样的文本读入pandas数据帧?它是一个纯文本文件。

<TABLE>
<CAPTION>
                                                  FORM 13F INFORMATION TABLE

          COLUMN 1               COLUMN 2     COLUMN 3   COLUMN 4        COLUMN 5        COLUMN 6  COLUMN 7        COLUMN 8
---------------------------- ---------------- --------- ----------- ------------------- ---------- -------- ----------------------
                                                           VALUE     SHRS OR   SH/ PUT/ INVESTMENT  OTHER      VOTING AUTHORITY
       NAME OF ISSUER         TITLE OF CLASS    CUSIP    (x$1000)    PRN AMT   PRN CALL DISCRETION MANAGERS    SOLE    SHARED NONE
---------------------------- ---------------- --------- ----------- ---------- --- ---- ---------- -------- ---------- ------ ----
<S>                          <C>              <C>       <C>         <C>        <C> <C>  <C>        <C>      <C>        <C>    <C>
7 DAYS GROUP HLDGS LTD       ADR              81783J101   19,317       999,322 SH       SOLE                   999,322      0    0
ACCENTURE PLC IRELAND        SHS CLASS A      G1151C101  200,952     3,325,917 SH       SOLE                 3,325,917      0    0
ACCRETIVE HEALTH INC         COM              00438V103   85,394     2,966,088 SH       SOLE                 2,966,088      0    0

我已尝试read_csvread_table,但不确定如何分隔列。 " "不起作用。

2 个答案:

答案 0 :(得分:1)

我在我的计算机上创建了一个名为mytext.txt的文本文件,然后用它来使用固定宽度格式而不是read_csv来读取它。

pd.read_fwf('mytext.txt', skiprows=4)

它产生的东西看起来像这样:

                       COLUMN 1          COLUMN 2  \
0  ----------------------------  ----------------   
1                           NaN               NaN   
2                NAME OF ISSUER    TITLE OF CLASS   
3  ----------------------------  ----------------   
4                           <S>               <C>   
5        7 DAYS GROUP HLDGS LTD               ADR   
6         ACCENTURE PLC IRELAND       SHS CLASS A   
7          ACCRETIVE HEALTH INC               COM   

         COLUMN 3   COLUMN 4        COLUMN 5    COLUMN 6  COLUMN 7  \
0  --------- ----------- -------------------  ----------  --------   
1               VALUE     SHRS OR   SH/ PUT/  INVESTMENT     OTHER   
2    CUSIP    (x$1000)    PRN AMT   PRN CALL  DISCRETION  MANAGERS   
3  --------- ----------- ---------- --- ----  ----------  --------   
4   <C>       <C>         <C>        <C> <C>         <C>       <C>   
5        81783J101   19,317       999,322 SH        SOLE       NaN   
6        G1151C101  200,952     3,325,917 SH        SOLE       NaN   
7        00438V103   85,394     2,966,088 SH        SOLE       NaN   

                 COLUMN 8  
0  ----------------------  
1        VOTING AUTHORITY  
2     SOLE    SHARED NONE  
3  ---------- ------ ----  
4   <C>        <C>    <C>  
5     999,322      0    0  
6   3,325,917      0    0  
7   2,966,088      0    0 

我不确定该文件是否采用您想要的格式,但您可以尝试使用skiprows79来尝试获取您想要的右栏中的数据。

答案 1 :(得分:1)

我认为它更复杂,因为read_fwf解析了某些列,而3 - 5df {{cols1列需要进行一些后处理1}}和列8df cols2,其功能为str.splitshiftilocdrop。然后使用concat将所有内容合并在一起:

import pandas as pd
import io

temp=u"""<TABLE>
<CAPTION>
                                                  FORM 13F INFORMATION TABLE

          COLUMN 1               COLUMN 2     COLUMN 3   COLUMN 4        COLUMN 5        COLUMN 6  COLUMN 7        COLUMN 8
---------------------------- ---------------- --------- ----------- ------------------- ---------- -------- ----------------------
                                                           VALUE     SHRS OR   SH/ PUT/ INVESTMENT  OTHER      VOTING AUTHORITY
       NAME OF ISSUER         TITLE OF CLASS    CUSIP    (x$1000)    PRN AMT   PRN CALL DISCRETION MANAGERS    SOLE    SHARED NONE
---------------------------- ---------------- --------- ----------- ---------- --- ---- ---------- -------- ---------- ------ ----
<S>                          <C>              <C>       <C>         <C>        <C> <C>  <C>        <C>      <C>        <C>    <C>
7 DAYS GROUP HLDGS LTD       ADR              81783J101   19,317       999,322 SH       SOLE                   999,322      0    0
ACCENTURE PLC IRELAND        SHS CLASS A      G1151C101  200,952     3,325,917 SH       SOLE                 3,325,917      0    0
ACCRETIVE HEALTH INC         COM              00438V103   85,394     2,966,088 SH       SOLE                 2,966,088      0    0"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_fwf(io.StringIO(temp), skiprows=[0,1,2,3,5,8,9])
print df
                 COLUMN 1        COLUMN 2  \
0                     NaN             NaN   
1          NAME OF ISSUER  TITLE OF CLASS   
2  7 DAYS GROUP HLDGS LTD             ADR   
3   ACCENTURE PLC IRELAND     SHS CLASS A   
4    ACCRETIVE HEALTH INC             COM   

       COLUMN 3   COLUMN 4        COLUMN 5    COLUMN 6  COLUMN 7  \
0             VALUE     SHRS OR   SH/ PUT/  INVESTMENT     OTHER   
1  CUSIP    (x$1000)    PRN AMT   PRN CALL  DISCRETION  MANAGERS   
2      81783J101   19,317       999,322 SH        SOLE       NaN   
3      G1151C101  200,952     3,325,917 SH        SOLE       NaN   
4      00438V103   85,394     2,966,088 SH        SOLE       NaN   

                COLUMN 8  
0       VOTING AUTHORITY  
1    SOLE    SHARED NONE  
2    999,322      0    0  
3  3,325,917      0    0  
4  2,966,088      0    0  
#split columns and create new df
cols1 = df.iloc[:, 2].str.split(expand=True)

#shift first row
cols1.iloc[0,:] = cols1.iloc[0,:].shift()

#concanecate columns
cols1.iloc[[0,1], 2] = cols1.iloc[[0,1], 2] + ' ' + cols1.iloc[[0,1], 3]

cols1.iloc[[0,1], 3] = cols1.iloc[[0,1], 4]

#remove column 4
cols1 = cols1.drop(4, axis=1)
#replace , to empty string with 1. and 2. columns
cols1.iloc[2:,1] = cols1.iloc[2:,1].str.replace(',', '')
cols1.iloc[2:,2] = cols1.iloc[2:,2].str.replace(',', '')
print cols1
           0         1        2    3     5
0        NaN     VALUE  SHRS OR  SH/  PUT/
1      CUSIP  (x$1000)  PRN AMT  PRN  CALL
2  81783J101     19317   999322   SH  None
3  G1151C101    200952  3325917   SH  None
4  00438V103     85394  2966088   SH  None    
#split columns and create new df
cols2 = df.iloc[:, 5].str.split(expand=True)
#replace , to empty string
cols2.iloc[2:,0] = cols2.iloc[2:,0].str.replace(',', '')
print cols2
         0          1     2
0   VOTING  AUTHORITY  None
1     SOLE     SHARED  NONE
2   999322          0     0
3  3325917          0     0
4  2966088          0     0
df = pd.concat([df.iloc[:,[0,1]], cols1, df.iloc[:,[3,4]], cols2], axis=1)
df.columns = range(12)
print df
                       0               1          2         3        4    5   \
0                     NaN             NaN        NaN     VALUE  SHRS OR  SH/   
1          NAME OF ISSUER  TITLE OF CLASS      CUSIP  (x$1000)  PRN AMT  PRN   
2  7 DAYS GROUP HLDGS LTD             ADR  81783J101     19317   999322   SH   
3   ACCENTURE PLC IRELAND     SHS CLASS A  G1151C101    200952  3325917   SH   
4    ACCRETIVE HEALTH INC             COM  00438V103     85394  2966088   SH   

     6           7         8        9          10    11  
0  PUT/  INVESTMENT     OTHER   VOTING  AUTHORITY  None  
1  CALL  DISCRETION  MANAGERS     SOLE     SHARED  NONE  
2  None        SOLE       NaN   999322          0     0  
3  None        SOLE       NaN  3325917          0     0  
4  None        SOLE       NaN  2966088          0     0 

如果您需要行12中的列名称,请使用reset_index,然后将字符串列转换为to_numeric

#column names from 2 rows to 1
df.iloc[1, 3:11] = df.iloc[0, 3:11] + ' ' + df.iloc[1, 3:11]

df.columns = df.iloc[1,:]

#data are from 2 rows (1,2 rows is header)
df1 = df.iloc[2:,:].reset_index(drop=True)
df1.columns.name = None

df1.iloc[:, 3] = pd.to_numeric( df1.iloc[:, 3])
df1.iloc[:, 4] = pd.to_numeric( df1.iloc[:, 4])
df1.iloc[:, 9] = pd.to_numeric( df1.iloc[:, 9])
df1.iloc[:, 10] = pd.to_numeric( df1.iloc[:, 10])
print df1
           NAME OF ISSUER TITLE OF CLASS      CUSIP  VALUE (x$1000)  \
0  7 DAYS GROUP HLDGS LTD            ADR  81783J101           19317   
1   ACCENTURE PLC IRELAND    SHS CLASS A  G1151C101          200952   
2    ACCRETIVE HEALTH INC            COM  00438V103           85394   

   SHRS OR PRN AMT SH/ PRN PUT/ CALL INVESTMENT DISCRETION OTHER MANAGERS  \
0           999322      SH      None                  SOLE            NaN   
1          3325917      SH      None                  SOLE            NaN   
2          2966088      SH      None                  SOLE            NaN   

   VOTING SOLE  AUTHORITY SHARED NONE  
0       999322                 0    0  
1      3325917                 0    0  
2      2966088                 0    0  

print df1.dtypes
NAME OF ISSUER           object
TITLE OF CLASS           object
CUSIP                    object
VALUE (x$1000)            int64
SHRS OR PRN AMT           int64
SH/ PRN                  object
PUT/ CALL                object
INVESTMENT DISCRETION    object
OTHER MANAGERS           object
VOTING SOLE               int64
AUTHORITY SHARED          int64
NONE                     object
dtype: object