如何从格式不正确的CSV中提取数据框

时间:2019-11-27 13:48:29

标签: python pandas csv dataframe

我有一堆怪异格式的CSV,我需要从其中提取一些数据并将其放入数据框。当我使用df = pd.read_csv(file)读取文件时,它看起来像:

            A       B       C      D     E
0   Account 1     111      20     10  12.0
1   Account 2     222      30     15   NaN
2   Account 3     333      40     25   NaN
3         NaN     NaN     NaN    NaN   NaN
4     Company    Name  Number  Price   NaN
5         AAA  AA Inc      15    100   NaN
6         NaN     NaN     NaN    NaN   NaN
7     Company     NaN     NaN    NaN   NaN
8          BB  BB Inc       5     20   NaN
9          CC  CC Inc      20     50   NaN
10         AA  AA Inc      12    100   NaN

但是有很多不需要的数据,因为我希望输出看起来像:

    Company    Name  Number  Price
0        AA  AA Inc      15    100
1        BB  BB Inc       5     20
2        CC  CC Inc      20     50
3        AA  AA Inc      12    100

我不能使用索引,因为有多个CSV,并且所需的数据并不总是在同一行上开始,因此程序需要相当灵活。我知道我可以编写带有特殊规则的函数,但这似乎容易出错且乏味。

那么有没有一种优雅的方法可以做到这一点?

1 个答案:

答案 0 :(得分:0)

代码:

import pandas as pd
import numpy as np


data_string = '''Account 1,111,20,10,12.0
Account 2,222,30,15,NaN
Account 3,333,40,25,NaN
NaN,NaN,NaN,NaN,NaN
Company,Name,Number,Price,NaN
AAA,AA Inc,15,100,NaN
NaN,NaN,NaN,NaN,NaN
Company,NaN,NaN,NaN,NaN
BB,BB Inc,5,20,NaN
CC,CC Inc,20,50,NaN
AA,AA Inc,12,100,NaN'''

df = pd.DataFrame(
    [x.split(',') for x in data_string.split('\n')],
    columns=list('ABCDE')).replace('NaN', np.nan)
print(df, '\n\n----\n')

first_row = df['A'].to_list().index('Company')
df = df.iloc[first_row:, :4]
df.columns = df.iloc[0].values
df = df.drop(df.index[0])
df = df[df['Company'] != 'Company'].dropna().reset_index(drop=True)
print(df)

输出:

            A       B       C      D     E
0   Account 1     111      20     10  12.0
1   Account 2     222      30     15   NaN
2   Account 3     333      40     25   NaN
3         NaN     NaN     NaN    NaN   NaN
4     Company    Name  Number  Price   NaN
5         AAA  AA Inc      15    100   NaN
6         NaN     NaN     NaN    NaN   NaN
7     Company     NaN     NaN    NaN   NaN
8          BB  BB Inc       5     20   NaN
9          CC  CC Inc      20     50   NaN
10         AA  AA Inc      12    100   NaN

----

  Company    Name Number Price
0     AAA  AA Inc     15   100
1      BB  BB Inc      5    20
2      CC  CC Inc     20    50
3      AA  AA Inc     12   100