我正在尝试从csv文件中获取一些项目,但是有一个问题,它具有不同的列数,因此我无法使用pandas.read_csv(filepath)函数读取它。我需要打开它,以便随后可以选择显示的一些项目。 csv文件如下(每行之间都有一个空白行,以便大家都可以轻松阅读):
“路径”,“文件”,“获取日期”,“示例”,“其他”
“ C:\ msdchem \ 2 \ DATA \ AlbertVirgili \ DaniGM \”,“ DGM_CPTIS003 1h.D”,“ 19年3月25日, 11:55:48“,” DGM_CPTIS003 1h“,”“
“ INT FID1A.CH”
“ 2019年3月25日星期一17:48:31”
“峰值”,“ R.T。”,“开始”,“结束”,“ PK TY”,“高度”,“面积”,“最大百分比”,“总计百分比”
1,2.082,2.063,2.189,“ BB”,223849319,4951058782,100.00,46.349
2,2.317,2.281,2.386,“ BB”,73209942,1093871144,22.09,10.240
3,3.343,3.224,3.403,“ BB”,93165657,2220621038,44.85,20.788
4,5.538,5.409,5.598,“ BB”,51783798,1975386485,39.90,18.492
5,5.744,5.693,5.803,“ BB”,24084957,360235490,7.28,3.372
6,8.716,8.676,8.776,“ BB”,8566883,80973220,1.64,0.758
“路径”,“文件”,“获取日期”,“示例”,“其他”
“ C:\ msdchem \ 2 \ DATA \ AlbertVirgili \ DaniGM \”,“ DGM_CPTIS003 2h.D”,“ 19年3月25日,12:15:42”,“ DGM_CPTIS003 2h”,“
“ INT FID1A.CH”
“ 2019年3月25日星期一12:31:45”
“峰值”,“ R.T。”,“开始”,“结束”,“ PK TY”,“高度”,“面积”,“最大百分比”,“总计百分比”
1,2.083,2.064,2.194,“ BB”,232382153,5255486688,100.00,59.673
2,2.318,2.282,2.384,“ BB”,37916041,587535474,11.18,6.671
3,3.322,3.241,3.381,“ BB”,67715293,1373898201,26.14,15.600
4,5.509,5.406,5.569,“ BB”,39502747,1227609422,23.36,13.939
5,5.731,5.689,5.791,“ BB”,17799521,230201751,4.38,2.614
6,8.717,8.674,8.776,“ BB”,12367646,132409300,2.52,1.503
我需要做的是阅读标题下的项目:Peak,RT,Start,End,PK TY ......,但由于它们与前几行的长度不同(带有标题,所以我不能这样做)路径,文件,获取日期...)。我不能使用skiprows函数来消除0-3和11-14中的行,因为我要读取的部分的行数并不总是一致的(这种类型的文件是由外部程序生成的,而我无法修改其结构)。有什么方法可以用来读取属于我所需标头的csv代码部分,以便可以从这些值中选择所需数据?
在此先感谢您的帮助。
答案 0 :(得分:1)
您需要进行一些预处理。如果您使用外部系统中的数据,那么考虑这些集成点非常普遍。
外部文件包含结构化数据。一系列CSV行,每个项目有5个标题行。最后的标题行包含CSV列标签。
从外部文件中读取内容。根据您的需要修改以下代码。
external_file_content = r'''
"Path","File","Date Acquired","Sample","Misc"
"C:\msdchem\2\DATA\AlbertVirgili\DaniGM\","DGM_CPTIS003 1h.D","25-Mar-19, 11:55:48","DGM_CPTIS003 1h"," "
"INT FID1A.CH"
"Mon Mar 25 17:48:31 2019"
"Peak","R.T.","Start","End","PK TY","Height","Area","Pct Max","Pct Total"
1, 2.082, 2.063, 2.189,"BB ",223849319,4951058782,100.00, 46.349
2, 2.317, 2.281, 2.386,"BB ",73209942,1093871144, 22.09, 10.240
3, 3.343, 3.224, 3.403,"BB ",93165657,2220621038, 44.85, 20.788
4, 5.538, 5.409, 5.598,"BB ",51783798,1975386485, 39.90, 18.492
5, 5.744, 5.693, 5.803,"BB ",24084957,360235490, 7.28, 3.372
6, 8.716, 8.676, 8.776,"BB ",8566883, 80973220, 1.64, 0.758
"Path","File","Date Acquired","Sample","Misc"
"C:\msdchem\2\DATA\AlbertVirgili\DaniGM\","DGM_CPTIS003 2h.D","25-Mar-19, 12:15:42","DGM_CPTIS003 2h"," "
"INT FID1A.CH"
"Mon Mar 25 12:31:45 2019"
"Peak","R.T.","Start","End","PK TY","Height","Area","Pct Max","Pct Total"
1, 2.083, 2.064, 2.194,"BB ",232382153,5255486688,100.00, 59.673
2, 2.318, 2.282, 2.384,"BB ",37916041,587535474, 11.18, 6.671
3, 3.322, 3.241, 3.381,"BB ",67715293,1373898201, 26.14, 15.600
4, 5.509, 5.406, 5.569,"BB ",39502747,1227609422, 23.36, 13.939
5, 5.731, 5.689, 5.791,"BB ",17799521,230201751, 4.38, 2.614
6, 8.717, 8.674, 8.776,"BB ",12367646,132409300, 2.52, 1.503
'''
使用定义明确的分隔符将序列分割为唯一的部分
parts = external_file_content.split('"Path","File","Date Acquired","Sample","Misc"')
选择一个零件以进一步处理成pandas DataFrame。将pd.read_csv
配置为跳过4行。
df = pd.read_csv(StringIO(parts[1]), skiprows=4);
显示DataFrame的第一行
df.head(5)
Peak R.T. Start End PK TY Height Area Pct Max Pct Total
0 1 2.082 2.063 2.189 BB 223849319 4951058782 100.00 46.349
1 2 2.317 2.281 2.386 BB 73209942 1093871144 22.09 10.240
2 3 3.343 3.224 3.403 BB 93165657 2220621038 44.85 20.788
3 4 5.538 5.409 5.598 BB 51783798 1975386485 39.90 18.492
4 5 5.744 5.693 5.803 BB 24084957 360235490 7.28 3.372
答案 1 :(得分:1)
过滤掉非数字行
def gen_rows(stream):
for row in csv.reader(stream):
if row.pop(0).isdigit(): # check that value is a number
yield row
with open('data.csv') as fo:
df = pd.DataFrame.from_records(gen_rows(fo),
columns = ["Peak","R.T.","Start","End","PKTY",
"Height","Area","Pct Max","Pct Total"])