pandas ValueError boolean dtype in" if,else"声明

时间:2014-05-24 03:59:08

标签: python pandas boolean conditional mask

我试图比较连续行的某列['记录编号']中的值。我希望稍后(希望)将另一列[' Desc']中的字符串连接成连续记录号码的一行'行,然后删除重复项。

无论如何,以下"如果"语句似乎不像布尔掩码,因为即使我按照它想要使用a.bool(),它也会抛出相同的错误:

" ValueError:系列的真值是不明确的。使用a.empty,a.bool(),a.item(),a.any()或a.all()。"

import pandas

with open('all.csv') as inc:
    indf = pandas.read_csv(inc, usecols=['Record Number', 'Service Date'], parse_dates=True)
    indf['Service Date'] = pandas.to_datetime(indf['Service Date'])
    indf.sort(['Service Date', 'Record Number'], inplace=True)
    indf['NUM'] = indf['Record Number'].shift(1)
    msk = indf['NUM'] == indf['Record Number']
    indf['MASK'] = msk
    print(indf)
    print(msk)
    for row in indf:
        if row['MASK'] == False:
        #if row['MASK'].bool() == False: ### this gives the same error
            print('Unique.')
        else:
            print('Dupe.')

我怎样才能解决这个问题?

编辑:修正了我的错字(如果 indf 行[' MASK']),但现在正在...

if row['MASK'] == False:
TypeError: string indices must be integers

if row[4] == False:
IndexError: string index out of range

为什么不允许' MASK'?为什么它会为字符串而烦恼呢? ' MASK'是布尔值。

Record Number             int64
Service Date     datetime64[ns]
NUM                     float64
MASK                       bool

示例数据:

Record Number,Service Date,Desc
746611,05/26/2014,jiber
361783,05/27/2014,manawyddan
231485,06/02/2014,montespan
254004,06/03/2014,peshawar
369750,06/09/2014,cochleate
757701,06/10/2014,verticity
586983,06/16/2014,psychotherapist
643669,06/17/2014,discreation
252213,06/23/2014,hemiacetal
863001,06/24/2014,jiber
563798,06/30/2014,manawyddan
229226,07/01/2014,montespan
772189,07/07/2014,peshawar
412939,07/08/2014,cochleate
230209,07/14/2014,verticity
723012,07/15/2014,psychotherapist
455138,07/21/2014,discreation
605876,07/22/2014,hemiacetal
565893,07/28/2014,jiber
760420,07/29/2014,manawyddan
667002,05/27/2014,montespan
676209,06/17/2014,peshawar
828060,06/24/2014,cochleate
582821,07/01/2014,verticity
275503,07/15/2014,psychotherapist
667002,05/26/2014,discreation
676209,06/02/2014,hemiacetal
828060,06/09/2014,jiber
667002,06/10/2014,manawyddan
676209,06/17/2014,montespan
828060,06/23/2014,peshawar
667002,06/24/2014,cochleate
676209,06/30/2014,verticity
828060,07/21/2014,psychotherapist
667002,07/28/2014,discreation
676209,05/27/2014,hemiacetal
828060,06/03/2014,jiber
667002,06/10/2014,manawyddan
676209,06/16/2014,montespan
828060,06/24/2014,peshawar
667002,07/01/2014,cochleate
676209,07/07/2014,verticity
828060,07/28/2014,psychotherapist
667002,07/29/2014,discreation
828060,06/09/2014,hemiacetal
667002,06/10/2014,jiber
676209,06/17/2014,manawyddan
828060,06/23/2014,montespan
667002,06/24/2014,peshawar
676209,06/30/2014,cochleate
828060,07/21/2014,verticity
828060,06/09/2014,psychotherapist
667002,06/10/2014,discreation
676209,06/17/2014,hemiacetal
828060,06/23/2014,jiber
667002,06/24/2014,manawyddan
676209,06/30/2014,montespan

1 个答案:

答案 0 :(得分:0)

编辑:问题(除了下面讨论的拼写错误)是你如何迭代DataFrame。直接迭代,迭代列名:

In [21]: df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=list('abc'))

In [22]: for col in df: print col  # what you're doing
a
b
c

您希望遍历行,因此请使用iterrows:

In [23]: list(df.iterrows())  # tuples of (index, row)
Out[23]:
[(0, a    1
     b    2
     c    3
     Name: 0, dtype: int64),
 (1, a    4
     b    5
     c    6
     Name: 1, dtype: int64)]

In [24]: for i, row in df.iterrows(): print row['b']
2
5

看起来这是一个拼写错误,indf['MASK'](系列)应该读row['MASK'](一个值)。 您的代码应该在该值上正常运行。

如异常消息中所述,布尔系列的真实性是模糊的(请参阅the mailing list上的一些讨论,这也是几个github问题的主题)。

基本问题是与python和numpy不一致(导致意外):

In [11]: bool([False])
Out[11]: True

In [12]: bool(np.array([False]))
Out[12]: False

并且在numpy中这会依赖于数组的长度:

In [21]: bool(np.array([False, True]))
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

而不是选择一方,熊猫不会选择任何一方 - 让所有人不高兴,使用正确的代码。