python pandas:比较连续的行,更新/连接连续重复项中第一行的单元格

时间:2014-05-25 13:12:12

标签: python pandas duplicates rows

在按日期排序的DataFrame上,然后按记录号(df.sort(['服务日期','记录号码'])排序,我试图:

  1. 比较'服务日期' '记录号码'到
  2. 下面的行
  3. 然后追加/连接/更新'描述'的内容。在第一行中包含'描述的内容如果行是重复的,则在第二行。
  4. 附加第三个,第四个等等'描述'如有必要。
  5. 我想最终得到所有'连续重复的内容最终在重复的第一个实例中。这是我到目前为止的代码。我在for循环中试过.shift(1),但无济于事:

    import pandas
    
    with open('ALL.CSV') as inc:
        indf = pandas.read_csv(inc, usecols=['Record Number', 'Service Date', 'Desc'], parse_dates=True)
        indf['Service Date'] = pandas.to_datetime(indf['Service Date'])
        indf.sort(['Service Date', 'Record Number'], inplace=True)
        indf['NUM'] = indf['Record Number'].shift(1)
        msk = indf['NUM'] == indf['Record Number']
        indf['MASK'] = msk
    #    print(indf)
    #    print(indf.dtypes)
    #    print(msk)
        for i, row in indf.iterrows():
            if row['MASK'] == False:
                print('Unique.', row['Record Number'], row['Service Date'], row['Desc'])
            else:
                print('Dupe...', row['Record Number'], row['Service Date'], row['Desc'])
    

    示例数据:

    Record Number,Service Date,Desc
    746611,05/26/2014,jiber
    361783,05/27/2014,manawyddan
    231485,06/02/2014,montespan
    254004,06/03/2014,peshawar
    369750,06/09/2014,cochleate
    757701,06/10/2014,verticity
    586983,06/16/2014,psychotherapist
    643669,06/17/2014,discreation
    252213,06/23/2014,hemiacetal
    863001,06/24/2014,jiber
    563798,06/30/2014,manawyddan
    229226,07/01/2014,montespan
    772189,07/07/2014,peshawar
    412939,07/08/2014,cochleate
    230209,07/14/2014,verticity
    723012,07/15/2014,psychotherapist
    455138,07/21/2014,discreation
    605876,07/22/2014,hemiacetal
    565893,07/28/2014,jiber
    760420,07/29/2014,manawyddan
    667002,05/27/2014,montespan
    676209,06/17/2014,peshawar
    828060,06/24/2014,cochleate
    582821,07/01/2014,verticity
    275503,07/15/2014,psychotherapist
    667002,05/26/2014,discreation
    676209,06/02/2014,hemiacetal
    828060,06/09/2014,jiber
    667002,06/10/2014,manawyddan
    676209,06/17/2014,montespan
    828060,06/23/2014,peshawar
    667002,06/24/2014,cochleate
    676209,06/30/2014,verticity
    828060,07/21/2014,psychotherapist
    667002,07/28/2014,discreation
    676209,05/27/2014,hemiacetal
    828060,06/03/2014,jiber
    667002,06/10/2014,manawyddan
    676209,06/16/2014,montespan
    828060,06/24/2014,peshawar
    667002,07/01/2014,cochleate
    676209,07/07/2014,verticity
    828060,07/28/2014,psychotherapist
    667002,07/29/2014,discreation
    828060,06/09/2014,hemiacetal
    667002,06/10/2014,jiber
    676209,06/17/2014,manawyddan
    828060,06/23/2014,montespan
    667002,06/24/2014,peshawar
    676209,06/30/2014,cochleate
    828060,07/21/2014,verticity
    828060,06/09/2014,psychotherapist
    667002,06/10/2014,discreation
    676209,06/17/2014,hemiacetal
    828060,06/23/2014,jiber
    667002,06/24/2014,manawyddan
    676209,06/30/2014,montespan
    编辑:我想我可能已经弄明白了。有谁看到更好的方式来解决这个问题?谢谢!

    import pandas
    
    with open('ALL.CSV') as inc:
        indf = pandas.read_csv(inc, usecols=['Record Number', 'Service Date', 'Desc'], parse_dates=True)
        indf['Service Date'] = pandas.to_datetime(indf['Service Date'])
        indf.sort(['Service Date', 'Record Number'], inplace=True)
        indf['NUM'] = indf['Record Number'].shift(1)
        msk = indf['NUM'] == indf['Record Number']
        indf['MASK'] = msk
        indf.reset_index(inplace=True)
    #    print(indf)
    #    print(indf.dtypes)
    #    print(msk)
        cnt = -1
        for i, row in indf.iterrows():
            cnt += 1
            if row['MASK'] == False:
                cnt = i
    #            print(i, cnt, 'Unique.', row['Record Number'], row['Service Date'], row['Desc'])
            else:
                cnt -= 1
    #            print(i, cnt, 'Dupe...', row['Record Number'], row['Service Date'], row['Desc'])
    #            print(indf['Desc'][cnt], indf['Desc'][i])
                indf['Desc'][cnt] = '. '.join([indf['Desc'][cnt], indf['Desc'][i]])
    #    print(indf)
        indf.drop_duplicates(['Service Date', 'Record Number'], inplace=True)
        del(indf['index'])
        del(indf['NUM'])
        del(indf['MASK'])
        indf.to_csv('ALL_fixed.csv', date_format='%m/%d/%Y', index=False)
    

1 个答案:

答案 0 :(得分:0)

如果你想要一个专栏' desc'其中包含所有值(这是我从您的问题中收集的信息,然后您可以将数据分组到' date'并通过将所有' desc'加入到聚合中来汇总数据一个字符串:

#this is the aggregation function
def desc_concat(x):
    return ", ".join(x)
# apply it to data grouped by date
df.groupby(['date', 'record']).agg({'desc' : desc_concat})