删除python 3.4.1中具有字符串值的pandas数据帧的行

时间:2014-10-27 08:01:02

标签: python-3.x pandas

我已经阅读了一个csv文件,其中pandas read_csv有8列。每列可以包含int / string / float值。但我想删除那些具有字符串值的行,并返回一个只包含数值的数据框。附上csv样本。
我试图运行以下代码:

import pandas as pd
import numpy as np  
df = pd.read_csv('new200_with_errors.csv',dtype={'Geo_Level_1' : int,'Geo_Level_2' : int,'Geo_Level_3' : int,'Product_Level_1' : int,'Product_Level_2' : int,'Product_Level_3' : int,'Total_Sale' : float})
print(df)

但是我收到以下错误:

TypeError: unorderable types: NoneType() > int()

我正在运行python 3.4.1。 这是样本csv。

Geo_L_1,Geo_L_2,Geo_L_3,Pro_L_1,Pro_L_2,Pro_L_3,Date,Sale
1, 2, 3, 129, 1, 5193316745, 1/1/2012, 9
1 ,2, 3, 129, 1, 5193316745, 1/1/2013,  
1, 2, 3, 129, 1, 5193316745, , 8
1, 2, 3, 129, NA, 5193316745, 1/10/2012, 10
1, 2, 3, 129, 1, 5193316745, 1/10/2013, 4
1, 2, 3, ghj, 1, 5193316745, 1/10/2014, 6
1, 2, 3, 129, 1, 5193316745, 1/11/2012, 4
1, 2, 3, 129, 1, ghgj, 1/11/2013, 2
1, 2, 3, 129, 1, 5193316745, 1/11/2014, 6
1, 2, 3, 129, 1, 5193316745, 1/12/2012, ghgj
1, 2, 3, 129, 1, 5193316745, 1/12/2013, 5

1 个答案:

答案 0 :(得分:1)

所以我接近这个的方法是尝试使用带有Try / Catch的用户函数将列转换为int来处理无法将值强制转换为Int的情况,这些设置为NaN值。删除您有空值的行,由于某种原因,当我使用您的数据对其进行测试时它实际上长度为1,它可能对您使用len 0。

In [42]:
# simple function to try to convert the type, returns NaN if the value cannot be coerced
def func(x):
    try:
        return int(x)
    except ValueError:
        return NaN
# assign multiple columns 
df['Pro_L_1'], df['Pro_L_3'], df['Sale'] = df['Pro_L_1'].apply(func), df['Pro_L_3'].apply(func), df['Sale'].apply(func)
# drop the 'empty' date row, take a copy() so we don't get a warning
df = df.loc[df['Date'].str.len() > 1].copy()
# convert the string to a datetime, if we didn't drop the row it would set the empty row to today's date
df['Date']= pd.to_datetime(df['Date'])
# now convert all the dtypes that are numeric to a numeric dtype
df = df.convert_objects(convert_numeric=True)
# check the dtypes
df.dtypes

Out[42]:
Geo_L_1             int64
Geo_L_2             int64
Geo_L_3             int64
Pro_L_1           float64
Pro_L_2           float64
Pro_L_3           float64
Date       datetime64[ns]
Sale              float64
dtype: object
In [43]:
# display the current situation
df
Out[43]:
    Geo_L_1  Geo_L_2  Geo_L_3  Pro_L_1  Pro_L_2     Pro_L_3       Date  Sale
0         1        2        3      129        1  5193316745 2012-01-01     9
1         1        2        3      129        1  5193316745 2013-01-01   NaN
3         1        2        3      129      NaN  5193316745 2012-01-10    10
4         1        2        3      129        1  5193316745 2013-01-10     4
5         1        2        3      NaN        1  5193316745 2014-01-10     6
6         1        2        3      129        1  5193316745 2012-01-11     4
7         1        2        3      129        1         NaN 2013-01-11     2
8         1        2        3      129        1  5193316745 2014-01-11     6
9         1        2        3      129        1  5193316745 2012-01-12   NaN
10        1        2        3      129        1  5193316745 2013-01-12     5
In [44]:
# drop the rows
df.dropna()
Out[44]:
    Geo_L_1  Geo_L_2  Geo_L_3  Pro_L_1  Pro_L_2     Pro_L_3       Date  Sale
0         1        2        3      129        1  5193316745 2012-01-01     9
4         1        2        3      129        1  5193316745 2013-01-10     4
6         1        2        3      129        1  5193316745 2012-01-11     4
8         1        2        3      129        1  5193316745 2014-01-11     6
10        1        2        3      129        1  5193316745 2013-01-12     5

对于最后一行,请指定df = df.dropna()