在按日期排序的DataFrame上,然后按记录号(df.sort(['服务日期','记录号码'])排序,我试图:
我想最终得到所有'连续重复的内容最终在重复的第一个实例中。这是我到目前为止的代码。我在for循环中试过.shift(1),但无济于事:
import pandas
with open('ALL.CSV') as inc:
indf = pandas.read_csv(inc, usecols=['Record Number', 'Service Date', 'Desc'], parse_dates=True)
indf['Service Date'] = pandas.to_datetime(indf['Service Date'])
indf.sort(['Service Date', 'Record Number'], inplace=True)
indf['NUM'] = indf['Record Number'].shift(1)
msk = indf['NUM'] == indf['Record Number']
indf['MASK'] = msk
# print(indf)
# print(indf.dtypes)
# print(msk)
for i, row in indf.iterrows():
if row['MASK'] == False:
print('Unique.', row['Record Number'], row['Service Date'], row['Desc'])
else:
print('Dupe...', row['Record Number'], row['Service Date'], row['Desc'])
示例数据:
Record Number,Service Date,Desc 746611,05/26/2014,jiber 361783,05/27/2014,manawyddan 231485,06/02/2014,montespan 254004,06/03/2014,peshawar 369750,06/09/2014,cochleate 757701,06/10/2014,verticity 586983,06/16/2014,psychotherapist 643669,06/17/2014,discreation 252213,06/23/2014,hemiacetal 863001,06/24/2014,jiber 563798,06/30/2014,manawyddan 229226,07/01/2014,montespan 772189,07/07/2014,peshawar 412939,07/08/2014,cochleate 230209,07/14/2014,verticity 723012,07/15/2014,psychotherapist 455138,07/21/2014,discreation 605876,07/22/2014,hemiacetal 565893,07/28/2014,jiber 760420,07/29/2014,manawyddan 667002,05/27/2014,montespan 676209,06/17/2014,peshawar 828060,06/24/2014,cochleate 582821,07/01/2014,verticity 275503,07/15/2014,psychotherapist 667002,05/26/2014,discreation 676209,06/02/2014,hemiacetal 828060,06/09/2014,jiber 667002,06/10/2014,manawyddan 676209,06/17/2014,montespan 828060,06/23/2014,peshawar 667002,06/24/2014,cochleate 676209,06/30/2014,verticity 828060,07/21/2014,psychotherapist 667002,07/28/2014,discreation 676209,05/27/2014,hemiacetal 828060,06/03/2014,jiber 667002,06/10/2014,manawyddan 676209,06/16/2014,montespan 828060,06/24/2014,peshawar 667002,07/01/2014,cochleate 676209,07/07/2014,verticity 828060,07/28/2014,psychotherapist 667002,07/29/2014,discreation 828060,06/09/2014,hemiacetal 667002,06/10/2014,jiber 676209,06/17/2014,manawyddan 828060,06/23/2014,montespan 667002,06/24/2014,peshawar 676209,06/30/2014,cochleate 828060,07/21/2014,verticity 828060,06/09/2014,psychotherapist 667002,06/10/2014,discreation 676209,06/17/2014,hemiacetal 828060,06/23/2014,jiber 667002,06/24/2014,manawyddan 676209,06/30/2014,montespan编辑:我想我可能已经弄明白了。有谁看到更好的方式来解决这个问题?谢谢!
import pandas
with open('ALL.CSV') as inc:
indf = pandas.read_csv(inc, usecols=['Record Number', 'Service Date', 'Desc'], parse_dates=True)
indf['Service Date'] = pandas.to_datetime(indf['Service Date'])
indf.sort(['Service Date', 'Record Number'], inplace=True)
indf['NUM'] = indf['Record Number'].shift(1)
msk = indf['NUM'] == indf['Record Number']
indf['MASK'] = msk
indf.reset_index(inplace=True)
# print(indf)
# print(indf.dtypes)
# print(msk)
cnt = -1
for i, row in indf.iterrows():
cnt += 1
if row['MASK'] == False:
cnt = i
# print(i, cnt, 'Unique.', row['Record Number'], row['Service Date'], row['Desc'])
else:
cnt -= 1
# print(i, cnt, 'Dupe...', row['Record Number'], row['Service Date'], row['Desc'])
# print(indf['Desc'][cnt], indf['Desc'][i])
indf['Desc'][cnt] = '. '.join([indf['Desc'][cnt], indf['Desc'][i]])
# print(indf)
indf.drop_duplicates(['Service Date', 'Record Number'], inplace=True)
del(indf['index'])
del(indf['NUM'])
del(indf['MASK'])
indf.to_csv('ALL_fixed.csv', date_format='%m/%d/%Y', index=False)
答案 0 :(得分:0)
如果你想要一个专栏' desc'其中包含所有值(这是我从您的问题中收集的信息,然后您可以将数据分组到' date'并通过将所有' desc'加入到聚合中来汇总数据一个字符串:
#this is the aggregation function
def desc_concat(x):
return ", ".join(x)
# apply it to data grouped by date
df.groupby(['date', 'record']).agg({'desc' : desc_concat})