我已经阅读了一个csv文件,其中pandas read_csv
有8列。每列可以包含int / string / float值。但我想删除那些具有字符串值的行,并返回一个只包含数值的数据框。附上csv样本。
我试图运行以下代码:
import pandas as pd
import numpy as np
df = pd.read_csv('new200_with_errors.csv',dtype={'Geo_Level_1' : int,'Geo_Level_2' : int,'Geo_Level_3' : int,'Product_Level_1' : int,'Product_Level_2' : int,'Product_Level_3' : int,'Total_Sale' : float})
print(df)
但是我收到以下错误:
TypeError: unorderable types: NoneType() > int()
我正在运行python 3.4.1。 这是样本csv。
Geo_L_1,Geo_L_2,Geo_L_3,Pro_L_1,Pro_L_2,Pro_L_3,Date,Sale
1, 2, 3, 129, 1, 5193316745, 1/1/2012, 9
1 ,2, 3, 129, 1, 5193316745, 1/1/2013,
1, 2, 3, 129, 1, 5193316745, , 8
1, 2, 3, 129, NA, 5193316745, 1/10/2012, 10
1, 2, 3, 129, 1, 5193316745, 1/10/2013, 4
1, 2, 3, ghj, 1, 5193316745, 1/10/2014, 6
1, 2, 3, 129, 1, 5193316745, 1/11/2012, 4
1, 2, 3, 129, 1, ghgj, 1/11/2013, 2
1, 2, 3, 129, 1, 5193316745, 1/11/2014, 6
1, 2, 3, 129, 1, 5193316745, 1/12/2012, ghgj
1, 2, 3, 129, 1, 5193316745, 1/12/2013, 5
答案 0 :(得分:1)
所以我接近这个的方法是尝试使用带有Try
/ Catch
的用户函数将列转换为int来处理无法将值强制转换为Int的情况,这些设置为NaN
值。删除您有空值的行,由于某种原因,当我使用您的数据对其进行测试时它实际上长度为1,它可能对您使用len 0。
In [42]:
# simple function to try to convert the type, returns NaN if the value cannot be coerced
def func(x):
try:
return int(x)
except ValueError:
return NaN
# assign multiple columns
df['Pro_L_1'], df['Pro_L_3'], df['Sale'] = df['Pro_L_1'].apply(func), df['Pro_L_3'].apply(func), df['Sale'].apply(func)
# drop the 'empty' date row, take a copy() so we don't get a warning
df = df.loc[df['Date'].str.len() > 1].copy()
# convert the string to a datetime, if we didn't drop the row it would set the empty row to today's date
df['Date']= pd.to_datetime(df['Date'])
# now convert all the dtypes that are numeric to a numeric dtype
df = df.convert_objects(convert_numeric=True)
# check the dtypes
df.dtypes
Out[42]:
Geo_L_1 int64
Geo_L_2 int64
Geo_L_3 int64
Pro_L_1 float64
Pro_L_2 float64
Pro_L_3 float64
Date datetime64[ns]
Sale float64
dtype: object
In [43]:
# display the current situation
df
Out[43]:
Geo_L_1 Geo_L_2 Geo_L_3 Pro_L_1 Pro_L_2 Pro_L_3 Date Sale
0 1 2 3 129 1 5193316745 2012-01-01 9
1 1 2 3 129 1 5193316745 2013-01-01 NaN
3 1 2 3 129 NaN 5193316745 2012-01-10 10
4 1 2 3 129 1 5193316745 2013-01-10 4
5 1 2 3 NaN 1 5193316745 2014-01-10 6
6 1 2 3 129 1 5193316745 2012-01-11 4
7 1 2 3 129 1 NaN 2013-01-11 2
8 1 2 3 129 1 5193316745 2014-01-11 6
9 1 2 3 129 1 5193316745 2012-01-12 NaN
10 1 2 3 129 1 5193316745 2013-01-12 5
In [44]:
# drop the rows
df.dropna()
Out[44]:
Geo_L_1 Geo_L_2 Geo_L_3 Pro_L_1 Pro_L_2 Pro_L_3 Date Sale
0 1 2 3 129 1 5193316745 2012-01-01 9
4 1 2 3 129 1 5193316745 2013-01-10 4
6 1 2 3 129 1 5193316745 2012-01-11 4
8 1 2 3 129 1 5193316745 2014-01-11 6
10 1 2 3 129 1 5193316745 2013-01-12 5
对于最后一行,请指定df = df.dropna()