从具有不同列数的csv文件中读取和选择项目

时间:2019-04-11 15:15:58

标签: python pandas csv

我正在尝试从csv文件中获取一些项目,但是有一个问题,它具有不同的列数,因此我无法使用pandas.read_csv(filepath)函数读取它。我需要打开它,以便随后可以选择显示的一些项目。 csv文件如下(每行之间都有一个空白行,以便大家都可以轻松阅读):

“路径”,“文件”,“获取日期”,“示例”,“其他”

“ C:\ msdchem \ 2 \ DATA \ AlbertVirgili \ DaniGM \”,“ DGM_CPTIS003 1h.D”,“ 19年3月25日, 11:55:48“,” DGM_CPTIS003 1h“,”“

“ INT FID1A.CH”

“ 2019年3月25日星期一17:48:31”

“峰值”,“ R.T。”,“开始”,“结束”,“ PK TY”,“高度”,“面积”,“最大百分比”,“总计百分比”

1,2.082,2.063,2.189,“ BB”,223849319,4951058782,100.00,46.349

2,2.317,2.281,2.386,“ BB”,73209942,1093871144,22.09,10.240

3,3.343,3.224,3.403,“ BB”,93165657,2220621038,44.85,20.788

4,5.538,5.409,5.598,“ BB”,51783798,1975386485,39.90,18.492

5,5.744,5.693,5.803,“ BB”,24084957,360235490,7.28,3.372

6,8.716,8.676,8.776,“ BB”,8566883,80973220,1.64,0.758

“路径”,“文件”,“获取日期”,“示例”,“其他”

“ C:\ msdchem \ 2 \ DATA \ AlbertVirgili \ DaniGM \”,“ DGM_CPTIS003 2h.D”,“ 19年3月25日,12:15:42”,“ DGM_CPTIS003 2h”,“

“ INT FID1A.CH”

“ 2019年3月25日星期一12:31:45”

“峰值”,“ R.T。”,“开始”,“结束”,“ PK TY”,“高度”,“面积”,“最大百分比”,“总计百分比”

1,2.083,2.064,2.194,“ BB”,232382153,5255486688,100.00,59.673

2,2.318,2.282,2.384,“ BB”,37916041,587535474,11.18,6.671

3,3.322,3.241,3.381,“ BB”,67715293,1373898201,26.14,15.600

4,5.509,5.406,5.569,“ BB”,39502747,1227609422,23.36,13.939

5,5.731,5.689,5.791,“ BB”,17799521,230201751,4.38,2.614

6,8.717,8.674,8.776,“ BB”,12367646,132409300,2.52,1.503

我需要做的是阅读标题下的项目:Peak,RT,Start,End,PK TY ......,但由于它们与前几行的长度不同(带有标题,所以我不能这样做)路径,文件,获取日期...)。我不能使用skiprows函数来消除0-3和11-14中的行,因为我要读取的部分的行数并不总是一致的(这种类型的文件是由外部程序生成的,而我无法修改其结构)。有什么方法可以用来读取属于我所需标头的csv代码部分,以便可以从这些值中选择所需数据?

在此先感谢您的帮助。

2 个答案:

答案 0 :(得分:1)

您需要进行一些预处理。如果您使用外部系统中的数据,那么考虑这些集成点非常普遍。

外部文件包含结构化数据。一系列CSV行,每个项目有5个标题行。最后的标题行包含CSV列标签。

从外部文件中读取内容。根据您的需要修改以下代码。

external_file_content = r'''
"Path","File","Date Acquired","Sample","Misc"
"C:\msdchem\2\DATA\AlbertVirgili\DaniGM\","DGM_CPTIS003 1h.D","25-Mar-19, 11:55:48","DGM_CPTIS003 1h"," "
"INT FID1A.CH"
"Mon Mar 25 17:48:31 2019"
"Peak","R.T.","Start","End","PK TY","Height","Area","Pct Max","Pct Total"
1, 2.082, 2.063, 2.189,"BB ",223849319,4951058782,100.00, 46.349
2, 2.317, 2.281, 2.386,"BB ",73209942,1093871144, 22.09, 10.240
3, 3.343, 3.224, 3.403,"BB ",93165657,2220621038, 44.85, 20.788
4, 5.538, 5.409, 5.598,"BB ",51783798,1975386485, 39.90, 18.492
5, 5.744, 5.693, 5.803,"BB ",24084957,360235490, 7.28, 3.372
6, 8.716, 8.676, 8.776,"BB ",8566883, 80973220, 1.64, 0.758
"Path","File","Date Acquired","Sample","Misc"
"C:\msdchem\2\DATA\AlbertVirgili\DaniGM\","DGM_CPTIS003 2h.D","25-Mar-19, 12:15:42","DGM_CPTIS003 2h"," "
"INT FID1A.CH"
"Mon Mar 25 12:31:45 2019"
"Peak","R.T.","Start","End","PK TY","Height","Area","Pct Max","Pct Total"
1, 2.083, 2.064, 2.194,"BB ",232382153,5255486688,100.00, 59.673
2, 2.318, 2.282, 2.384,"BB ",37916041,587535474, 11.18, 6.671
3, 3.322, 3.241, 3.381,"BB ",67715293,1373898201, 26.14, 15.600
4, 5.509, 5.406, 5.569,"BB ",39502747,1227609422, 23.36, 13.939
5, 5.731, 5.689, 5.791,"BB ",17799521,230201751, 4.38, 2.614
6, 8.717, 8.674, 8.776,"BB ",12367646,132409300, 2.52, 1.503
'''

使用定义明确的分隔符将序列分割为唯一的部分

parts = external_file_content.split('"Path","File","Date Acquired","Sample","Misc"')

选择一个零件以进一步处理成pandas DataFrame。将pd.read_csv配置为跳过4行。

df = pd.read_csv(StringIO(parts[1]), skiprows=4);

显示DataFrame的第一行

df.head(5)


    Peak    R.T.    Start   End     PK TY   Height  Area    Pct Max     Pct Total
0   1   2.082   2.063   2.189   BB  223849319   4951058782  100.00  46.349
1   2   2.317   2.281   2.386   BB  73209942    1093871144  22.09   10.240
2   3   3.343   3.224   3.403   BB  93165657    2220621038  44.85   20.788
3   4   5.538   5.409   5.598   BB  51783798    1975386485  39.90   18.492
4   5   5.744   5.693   5.803   BB  24084957    360235490   7.28    3.372

答案 1 :(得分:1)

过滤掉非数字行

def gen_rows(stream):
    for row in csv.reader(stream):             
        if row.pop(0).isdigit(): # check that value is a number  
            yield row

with open('data.csv') as fo:
    df = pd.DataFrame.from_records(gen_rows(fo), 
    columns = ["Peak","R.T.","Start","End","PKTY",
                    "Height","Area","Pct Max","Pct Total"])