我试图比较连续行的某列['记录编号']中的值。我希望稍后(希望)将另一列[' Desc']中的字符串连接成连续记录号码的一行'行,然后删除重复项。
无论如何,以下"如果"语句似乎不像布尔掩码,因为即使我按照它想要使用a.bool(),它也会抛出相同的错误:
" ValueError:系列的真值是不明确的。使用a.empty,a.bool(),a.item(),a.any()或a.all()。"
import pandas
with open('all.csv') as inc:
indf = pandas.read_csv(inc, usecols=['Record Number', 'Service Date'], parse_dates=True)
indf['Service Date'] = pandas.to_datetime(indf['Service Date'])
indf.sort(['Service Date', 'Record Number'], inplace=True)
indf['NUM'] = indf['Record Number'].shift(1)
msk = indf['NUM'] == indf['Record Number']
indf['MASK'] = msk
print(indf)
print(msk)
for row in indf:
if row['MASK'] == False:
#if row['MASK'].bool() == False: ### this gives the same error
print('Unique.')
else:
print('Dupe.')
我怎样才能解决这个问题?
编辑:修正了我的错字(如果 indf 行[' MASK']),但现在正在...
if row['MASK'] == False:
TypeError: string indices must be integers
和
if row[4] == False:
IndexError: string index out of range
为什么不允许' MASK'?为什么它会为字符串而烦恼呢? ' MASK'是布尔值。
Record Number int64 Service Date datetime64[ns] NUM float64 MASK bool
示例数据:
Record Number,Service Date,Desc 746611,05/26/2014,jiber 361783,05/27/2014,manawyddan 231485,06/02/2014,montespan 254004,06/03/2014,peshawar 369750,06/09/2014,cochleate 757701,06/10/2014,verticity 586983,06/16/2014,psychotherapist 643669,06/17/2014,discreation 252213,06/23/2014,hemiacetal 863001,06/24/2014,jiber 563798,06/30/2014,manawyddan 229226,07/01/2014,montespan 772189,07/07/2014,peshawar 412939,07/08/2014,cochleate 230209,07/14/2014,verticity 723012,07/15/2014,psychotherapist 455138,07/21/2014,discreation 605876,07/22/2014,hemiacetal 565893,07/28/2014,jiber 760420,07/29/2014,manawyddan 667002,05/27/2014,montespan 676209,06/17/2014,peshawar 828060,06/24/2014,cochleate 582821,07/01/2014,verticity 275503,07/15/2014,psychotherapist 667002,05/26/2014,discreation 676209,06/02/2014,hemiacetal 828060,06/09/2014,jiber 667002,06/10/2014,manawyddan 676209,06/17/2014,montespan 828060,06/23/2014,peshawar 667002,06/24/2014,cochleate 676209,06/30/2014,verticity 828060,07/21/2014,psychotherapist 667002,07/28/2014,discreation 676209,05/27/2014,hemiacetal 828060,06/03/2014,jiber 667002,06/10/2014,manawyddan 676209,06/16/2014,montespan 828060,06/24/2014,peshawar 667002,07/01/2014,cochleate 676209,07/07/2014,verticity 828060,07/28/2014,psychotherapist 667002,07/29/2014,discreation 828060,06/09/2014,hemiacetal 667002,06/10/2014,jiber 676209,06/17/2014,manawyddan 828060,06/23/2014,montespan 667002,06/24/2014,peshawar 676209,06/30/2014,cochleate 828060,07/21/2014,verticity 828060,06/09/2014,psychotherapist 667002,06/10/2014,discreation 676209,06/17/2014,hemiacetal 828060,06/23/2014,jiber 667002,06/24/2014,manawyddan 676209,06/30/2014,montespan
答案 0 :(得分:0)
编辑:问题(除了下面讨论的拼写错误)是你如何迭代DataFrame。直接迭代,迭代列名:
In [21]: df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=list('abc'))
In [22]: for col in df: print col # what you're doing
a
b
c
您希望遍历行,因此请使用iterrows:
In [23]: list(df.iterrows()) # tuples of (index, row)
Out[23]:
[(0, a 1
b 2
c 3
Name: 0, dtype: int64),
(1, a 4
b 5
c 6
Name: 1, dtype: int64)]
In [24]: for i, row in df.iterrows(): print row['b']
2
5
看起来这是一个拼写错误,indf['MASK']
(系列)应该读row['MASK']
(一个值)。 您的代码应该在该值上正常运行。
如异常消息中所述,布尔系列的真实性是模糊的(请参阅the mailing list上的一些讨论,这也是几个github问题的主题)。
基本问题是与python和numpy不一致(导致意外):
In [11]: bool([False])
Out[11]: True
In [12]: bool(np.array([False]))
Out[12]: False
并且在numpy中这会依赖于数组的长度:
In [21]: bool(np.array([False, True]))
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
而不是选择一方,熊猫不会选择任何一方 - 让所有人不高兴,但使用正确的代码。