熊猫清洗数据框

时间:2018-12-12 13:46:34

标签: python pandas dataframe data-cleaning

我目前正在学习熊猫,在清理数据框时遇到问题:

"TIMESTAMP","RECORD","WM1_u_ms","WM1_v_ms","WM1_w_ms","WM2_u_ms","WM2_v_ms","WM2_w_ms","WS1_u_ms","WS1_v_ms"
"2018-04-06 14:31:11.5",29699805,2.628,4.629,0.599,3.908,7.971,0.47,2.51,7.18
"2018-04-06 14:31:11.75",29699806,3.264,4.755,-0.095,2.961,6.094,-0.504,2.47,7.18
"2018-04-06 14:31:12",29699807,1.542,5.793,0.698,4.95,4.91,0.845,2.18,7.5
"2018-04-06 14:31:12.25",29699808,2.527,5.207,0.012,4.843,6.285,0.924,2.15,7.4
"2018-04-06 14:31:12.5",29699809,3.511,4.528,1.059,2.986,5.636,0.949,3.29,5.54
"2018-04-06 14:31:12.75",29699810,3.445,3.957,-0.075,3.127,6.561,0.259,3.85,5.45
"2018-04-06 14:31:13",29699811,2.624,5.238,-0.166,3.451,7.199,0.242,3.94,6.24

df = pd.read_csv(FilePath,parse_dates=True)  #read the csv file and save it into a variable
df = df.drop(['RECORD'],axis=1)

dtypes

我不明白为什么熊猫将零件识别为float64而其他零件识别为对象。你们有什么线索吗? 因此,我开始尝试自行转换列:

df['TIMESTAMP'] = pd.to_datetime(df['TIMESTAMP'])
df['WM1_u_ms':] = df.iloc[:, df.columns != 'TIMESTAMP'].values.astype(float)

但是我得到一个错误:

cannot do slice indexing on <class 'pandas.core.indexes.range.RangeIndex'> with these indexers [WM1_u_ms] of <class 'str'>

为什么熊猫从一开始就无法正确读取.dat文件,我转换它的错误是什么?在下一个临时文件中,我想通过df.interpolate()进行插值以清除nan的

感谢您的帮助!

1 个答案:

答案 0 :(得分:4)

我认为您可以通过参数DatetimeIndexparse_datesread_csv中创建index_col

df = pd.read_csv(FilePath, parse_dates=['TIMESTAMP'], index_col=['TIMESTAMP'])

df = df.drop(['RECORD'],axis=1)

但是我认为有一些非数字值,因此必须将to_numericerrors='coerce'一起解析为NaN s:

df = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))

带有示例数据的示例-但为object列添加了字符串:

import pandas as pd

pd.options.display.max_columns = 20

temp=u""""TIMESTAMP","RECORD","WM1_u_ms","WM1_v_ms","WM1_w_ms","WM2_u_ms","WM2_v_ms","WM2_w_ms","WS1_u_ms","WS1_v_ms"
"2018-04-06;14:31:11.5",29699805,2.628a,4.629a,0.599s,3.908,7.971,0.47,2;;51,7.18
"2018-04-06;14:31:11.75",29699806,3.264,4.755,-0.095,2.961,6.094,-0.504,2.47,7.18
"2018-04-06;14:31:12",29699807,1.542,5.793,0.698,4.95,4.91,0.845,2.18,7.5
"2018-04-06;14:31:12.25",29699808,2.527,5.207,0.012,4.843,6.285,0.924,2.15,7.4
"2018-04-06;14:31:12.5",29699809,3.511,4.528,1.059,2.986,5.636,0.949,3.29,5.54
"2018-04-06;14:31:12.75",29699810,3.445,3.957,-0.075,3.127,6.561,0.259,3.85,5.45
"2018-04-06;14:31:13",29699811,2.624,5.238,-0.166,3.451,7.199,0.242,3.94,a"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), parse_dates=['TIMESTAMP'], index_col=['TIMESTAMP'])

print (df)
                           RECORD WM1_u_ms WM1_v_ms WM1_w_ms  WM2_u_ms  \
TIMESTAMP                                                                
2018-04-06 14:31:11.500  29699805   2.628a   4.629a   0.599s     3.908   
2018-04-06 14:31:11.750  29699806    3.264    4.755   -0.095     2.961   
2018-04-06 14:31:12.000  29699807    1.542    5.793    0.698     4.950   
2018-04-06 14:31:12.250  29699808    2.527    5.207    0.012     4.843   
2018-04-06 14:31:12.500  29699809    3.511    4.528    1.059     2.986   
2018-04-06 14:31:12.750  29699810    3.445    3.957   -0.075     3.127   
2018-04-06 14:31:13.000  29699811    2.624    5.238   -0.166     3.451   

                         WM2_v_ms  WM2_w_ms WS1_u_ms WS1_v_ms  
TIMESTAMP                                                      
2018-04-06 14:31:11.500     7.971     0.470    2;;51     7.18  
2018-04-06 14:31:11.750     6.094    -0.504     2.47     7.18  
2018-04-06 14:31:12.000     4.910     0.845     2.18      7.5  
2018-04-06 14:31:12.250     6.285     0.924     2.15      7.4  
2018-04-06 14:31:12.500     5.636     0.949     3.29     5.54  
2018-04-06 14:31:12.750     6.561     0.259     3.85     5.45  
2018-04-06 14:31:13.000     7.199     0.242     3.94        a  

print (df.dtypes)
RECORD        int64
WM1_u_ms     object
WM1_v_ms     object
WM1_w_ms     object
WM2_u_ms    float64
WM2_v_ms    float64
WM2_w_ms    float64
WS1_u_ms     object
WS1_v_ms     object
dtype: object

print (df.index)
DatetimeIndex(['2018-04-06 14:31:11.500000', '2018-04-06 14:31:11.750000',
                      '2018-04-06 14:31:12', '2018-04-06 14:31:12.250000',
               '2018-04-06 14:31:12.500000', '2018-04-06 14:31:12.750000',
                      '2018-04-06 14:31:13'],
              dtype='datetime64[ns]', name='TIMESTAMP', freq=None)


df = df.drop(['RECORD'],axis=1)
df = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))

print (df)
                         WM1_u_ms  WM1_v_ms  WM1_w_ms  WM2_u_ms  WM2_v_ms  \
TIMESTAMP                                                                   
2018-04-06 14:31:11.500       NaN       NaN       NaN     3.908     7.971   
2018-04-06 14:31:11.750     3.264     4.755    -0.095     2.961     6.094   
2018-04-06 14:31:12.000     1.542     5.793     0.698     4.950     4.910   
2018-04-06 14:31:12.250     2.527     5.207     0.012     4.843     6.285   
2018-04-06 14:31:12.500     3.511     4.528     1.059     2.986     5.636   
2018-04-06 14:31:12.750     3.445     3.957    -0.075     3.127     6.561   
2018-04-06 14:31:13.000     2.624     5.238    -0.166     3.451     7.199   

                         WM2_w_ms  WS1_u_ms  WS1_v_ms  
TIMESTAMP                                              
2018-04-06 14:31:11.500     0.470       NaN      7.18  
2018-04-06 14:31:11.750    -0.504      2.47      7.18  
2018-04-06 14:31:12.000     0.845      2.18      7.50  
2018-04-06 14:31:12.250     0.924      2.15      7.40  
2018-04-06 14:31:12.500     0.949      3.29      5.54  
2018-04-06 14:31:12.750     0.259      3.85      5.45  
2018-04-06 14:31:13.000     0.242      3.94       NaN  

print (df.dtypes)
WM1_u_ms    float64
WM1_v_ms    float64
WM1_w_ms    float64
WM2_u_ms    float64
WM2_v_ms    float64
WM2_w_ms    float64
WS1_u_ms    float64
WS1_v_ms    float64
dtype: object