转换列pandas数据帧python 3期间出错

时间:2017-04-12 10:57:44

标签: python-3.x pandas valueerror

我和熊猫有一个大问题。我有一个包含

的重要数据框
Ref_id   PRICE    YEAR  MONTH BRAND
100000   '5000'  '2012' '4'   'FORD'
100001   '10000' '2015' '5'   'MERCEDES'
...

我想转换我的PRICE,YEAR和MONTH列但是当我在列上使用.astype(int)或.apply(lambda x:int(x))时,我收到了一个ValueError。我的数据帧长度为180万行。

ValueError: invalid literal for int() with base 10: 'PRICE'

所以我不明白为什么pandas想要转换列的名称。

你可以解释一下为什么吗?

最佳,

℃。

1 个答案:

答案 0 :(得分:2)

试试这个:

In [59]: cols = 'PRICE  YEAR  MONTH'.split()

In [60]: cols
Out[60]: ['PRICE', 'YEAR', 'MONTH']

In [61]: for c in cols:
    ...:     df[c] = pd.to_numeric(df[c], errors='coerce')
    ...:

In [62]: df
Out[62]:
   Ref_id    PRICE  YEAR  MONTH     BRAND
0  100000   5000.0  2012      4      FORD
1  100001  10000.0  2015      5  MERCEDES
2  100002      NaN  2016      6      AUDI

重现错误:

In [65]: df
Out[65]:
   Ref_id  PRICE  YEAR  MONTH     BRAND
0  100000   5000  2012      4      FORD
1  100001  10000  2015      5  MERCEDES
2  100002  PRICE  2016      6      AUDI  # pay attention at `PRICE` value !!!

In [66]: df['PRICE'].astype(int)
...
skipped
...
ValueError: invalid literal for int() with base 10: 'PRICE'

作为@jezrael has added in this comment,您的数据集中可能存在“不良”(意外)值。

您可以使用以下技术之一进行清理:

In [155]: df
Out[155]:
   Ref_id  PRICE  YEAR  MONTH     BRAND
0  100000   5000  2012      4      FORD
1  100001  10000  2015      5  MERCEDES
2  Ref_id  PRICE  YEAR  MONTH     BRAND
3  100002  15000  2016      5      AUDI

In [156]: df.dtypes
Out[156]:
Ref_id    object
PRICE     object
YEAR      object
MONTH     object
BRAND     object
dtype: object

In [157]: df = df.drop(df.loc[df.PRICE == 'PRICE'].index)

In [158]: df
Out[158]:
   Ref_id  PRICE  YEAR MONTH     BRAND
0  100000   5000  2012     4      FORD
1  100001  10000  2015     5  MERCEDES
3  100002  15000  2016     5      AUDI

In [159]: for c in cols:
     ...:     df[c] = pd.to_numeric(df[c], errors='coerce')
     ...:

In [160]: df
Out[160]:
   Ref_id  PRICE  YEAR  MONTH     BRAND
0  100000   5000  2012      4      FORD
1  100001  10000  2015      5  MERCEDES
3  100002  15000  2016      5      AUDI

In [161]: df.dtypes
Out[161]:
Ref_id    object
PRICE      int64
YEAR       int64
MONTH      int64
BRAND     object
dtype: object

或简单地说:

In [159]: for c in cols:
     ...:     df[c] = pd.to_numeric(df[c], errors='coerce')
     ...:

In [165]: df
Out[165]:
   Ref_id    PRICE    YEAR  MONTH     BRAND
0  100000   5000.0  2012.0    4.0      FORD
1  100001  10000.0  2015.0    5.0  MERCEDES
2  Ref_id      NaN     NaN    NaN     BRAND
3  100002  15000.0  2016.0    5.0      AUDI

然后.dropna(how='any')如果您知道原始数据集中没有NaN

In [166]: df = df.dropna(how='any')

In [167]: df
Out[167]:
   Ref_id    PRICE    YEAR  MONTH     BRAND
0  100000   5000.0  2012.0    4.0      FORD
1  100001  10000.0  2015.0    5.0  MERCEDES
3  100002  15000.0  2016.0    5.0      AUDI