Question

我试图比较连续行的某列[＆＃39;记录编号＆＃39;]中的值。我希望稍后（希望）将另一列[＆＃39; Desc＆＃39;]中的字符串连接成连续记录号码的一行＆＃39;行，然后删除重复项。

无论如何，以下＆＃34;如果＆＃34;语句似乎不像布尔掩码，因为即使我按照它想要使用a.bool（），它也会抛出相同的错误：

＆＃34; ValueError：系列的真值是不明确的。使用a.empty，a.bool（），a.item（），a.any（）或a.all（）。＆＃34;

import pandas

with open('all.csv') as inc:
    indf = pandas.read_csv(inc, usecols=['Record Number', 'Service Date'], parse_dates=True)
    indf['Service Date'] = pandas.to_datetime(indf['Service Date'])
    indf.sort(['Service Date', 'Record Number'], inplace=True)
    indf['NUM'] = indf['Record Number'].shift(1)
    msk = indf['NUM'] == indf['Record Number']
    indf['MASK'] = msk
    print(indf)
    print(msk)
    for row in indf:
        if row['MASK'] == False:
        #if row['MASK'].bool() == False: ### this gives the same error
            print('Unique.')
        else:
            print('Dupe.')

我怎样才能解决这个问题？

编辑：修正了我的错字（如果 ~~indf~~ 行[＆＃39; MASK＆＃39;]），但现在正在...

if row['MASK'] == False:
TypeError: string indices must be integers

和

if row[4] == False:
IndexError: string index out of range

为什么不允许＆＃39; MASK＆＃39;？为什么它会为字符串而烦恼呢？＆＃39; MASK＆＃39;是布尔值。

Record Number             int64
Service Date     datetime64[ns]
NUM                     float64
MASK                       bool

示例数据：

Record Number,Service Date,Desc
746611,05/26/2014,jiber
361783,05/27/2014,manawyddan
231485,06/02/2014,montespan
254004,06/03/2014,peshawar
369750,06/09/2014,cochleate
757701,06/10/2014,verticity
586983,06/16/2014,psychotherapist
643669,06/17/2014,discreation
252213,06/23/2014,hemiacetal
863001,06/24/2014,jiber
563798,06/30/2014,manawyddan
229226,07/01/2014,montespan
772189,07/07/2014,peshawar
412939,07/08/2014,cochleate
230209,07/14/2014,verticity
723012,07/15/2014,psychotherapist
455138,07/21/2014,discreation
605876,07/22/2014,hemiacetal
565893,07/28/2014,jiber
760420,07/29/2014,manawyddan
667002,05/27/2014,montespan
676209,06/17/2014,peshawar
828060,06/24/2014,cochleate
582821,07/01/2014,verticity
275503,07/15/2014,psychotherapist
667002,05/26/2014,discreation
676209,06/02/2014,hemiacetal
828060,06/09/2014,jiber
667002,06/10/2014,manawyddan
676209,06/17/2014,montespan
828060,06/23/2014,peshawar
667002,06/24/2014,cochleate
676209,06/30/2014,verticity
828060,07/21/2014,psychotherapist
667002,07/28/2014,discreation
676209,05/27/2014,hemiacetal
828060,06/03/2014,jiber
667002,06/10/2014,manawyddan
676209,06/16/2014,montespan
828060,06/24/2014,peshawar
667002,07/01/2014,cochleate
676209,07/07/2014,verticity
828060,07/28/2014,psychotherapist
667002,07/29/2014,discreation
828060,06/09/2014,hemiacetal
667002,06/10/2014,jiber
676209,06/17/2014,manawyddan
828060,06/23/2014,montespan
667002,06/24/2014,peshawar
676209,06/30/2014,cochleate
828060,07/21/2014,verticity
828060,06/09/2014,psychotherapist
667002,06/10/2014,discreation
676209,06/17/2014,hemiacetal
828060,06/23/2014,jiber
667002,06/24/2014,manawyddan
676209,06/30/2014,montespan

Answer 1

编辑：问题（除了下面讨论的拼写错误）是你如何迭代DataFrame。直接迭代，迭代列名：

In [21]: df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=list('abc'))

In [22]: for col in df: print col  # what you're doing
a
b
c

您希望遍历行，因此请使用iterrows：

In [23]: list(df.iterrows())  # tuples of (index, row)
Out[23]:
[(0, a    1
     b    2
     c    3
     Name: 0, dtype: int64),
 (1, a    4
     b    5
     c    6
     Name: 1, dtype: int64)]

In [24]: for i, row in df.iterrows(): print row['b']
2
5

看起来这是一个拼写错误，indf['MASK']（系列）应该读row['MASK']（一个值）。 您的代码应该在该值上正常运行。

如异常消息中所述，布尔系列的真实性是模糊的（请参阅the mailing list上的一些讨论，这也是几个github问题的主题）。

基本问题是与python和numpy不一致（导致意外）：

In [11]: bool([False])
Out[11]: True

In [12]: bool(np.array([False]))
Out[12]: False

并且在numpy中这会依赖于数组的长度：

In [21]: bool(np.array([False, True]))
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

而不是选择一方，熊猫不会选择任何一方 - 让所有人不高兴，但使用正确的代码。

pandas ValueError boolean dtype in＆＃34; if，else＆＃34;声明

1 个答案: