熊猫计算重复条目

时间:2021-03-15 12:57:44

标签: python-3.x pandas data-analysis

这是我的示例数据框

Price   DateOfTrasfer   PAON    Street
115000  2018-07-13 00:00    4   THE LANE
24000   2018-04-10 00:00    20  WOODS TERRACE
56000   2018-06-22 00:00    6   HEILD CLOSE
220000  2018-05-25 00:00    25  BECKWITH CLOSE
58000   2018-05-09 00:00    23  AINTREE DRIVE
115000  2018-06-21 00:00    4   EDEN VALE MEWS
82000   2018-06-01 00:00    24  ARKLESS GROVE
93000   2018-07-06 00:00    14  HORTON CRESCENT
42500   2018-06-27 00:00    18  CATHERINE TERRACE
172000  2018-05-25 00:00    67  HOLLY CRESCENT

这是要执行的任务:

对于在数据集中出现多次的任何地址,将持有期定义为时间 涉及该财产的任何两个连续交易之间(即 N(holding_periods) = N(appearances) - 1. 实现一个函数,获取价格支付数据并返回 持有期的平均长度和购买之间的年化价值变化 和出售,按持有期结束的年份和房产类型分组。

def holding_time(df):

  df = df.copy()
  # to work only with dates (day)
  df.DateOfTrasfer = pd.to_datetime(df.DateOfTrasfer)
  
  cols = ['PAON', 'Street']
  df['address'] = df[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
  df.drop(["PAON", 'Street'],axis=1,inplace=True)

  df = df.groupby(['address', 'Price'],as_index=False).agg({'PPD':'size'})\
  .rename(columns={'PPD':'count_2'})

  return df

2 个答案:

答案 0 :(得分:1)

此脚本创建包含个人持有时间、该房产的平均持有时间以及持有期间价格变化的列:

import numpy as np
import pandas as pd

# assume df is defined above ...

hdf = df.groupby("Street", sort=False).apply(lambda c: c.values[:,1]).reset_index(name='hgb')
pdf = df.groupby("Street", sort=False).apply(lambda c: c.values[:,0]).reset_index(name='pgb')

df['holding_periods'] = hdf['hgb'].apply(lambda c: np.diff(c.astype(np.datetime64)))
df['price_changes']   = pdf['pgb'].apply(lambda c: np.diff(c.astype(np.int64)))

df['holding_periods'] = df['holding_periods'].fillna("").apply(list)
df['avg_hold'] = df['holding_periods'].apply(lambda c: np.array(c).astype(np.float64).mean() if c else 0).fillna(0)
df.drop_duplicates(subset=['Street','avg_hold'], keep=False, inplace=True)

我为“Heild Close”创建了 2 个新的虚拟条目来测试它:

# Input:
     Price DateOfTransfer PAON             Street
0   115000     2018-07-13    4           THE LANE
1    24000     2018-04-10   20      WOODS TERRACE
2    56000     2018-06-22    6        HEILD CLOSE
3   220000     2018-05-25   25     BECKWITH CLOSE
4    58000     2018-05-09   23      AINTREE DRIVE
5   115000     2018-06-21    4     EDEN VALE MEWS
6    82000     2018-06-01   24      ARKLESS GROVE
7    93000     2018-07-06   14    HORTON CRESCENT
8    42500     2018-06-27   18  CATHERINE TERRACE
9   172000     2018-05-25   67     HOLLY CRESCENT
10   59000     2018-06-27   12        HEILD CLOSE
11  191000     2018-07-13    1        HEILD CLOSE


# Output:

    Price DateOfTransfer PAON             Street    holding_periods   price_changes  avg_hold
0  115000     2018-07-13    4           THE LANE                 []              []       0.0
1   24000     2018-04-10   20      WOODS TERRACE                 []              []       0.0
2   56000     2018-06-22    6        HEILD CLOSE  [5 days, 16 days]  [3000, 132000]      10.5
3  220000     2018-05-25   25     BECKWITH CLOSE                 []              []       0.0
4   58000     2018-05-09   23      AINTREE DRIVE                 []              []       0.0
5  115000     2018-06-21    4     EDEN VALE MEWS                 []              []       0.0
6   82000     2018-06-01   24      ARKLESS GROVE                 []              []       0.0
7   93000     2018-07-06   14    HORTON CRESCENT                 []              []       0.0
8   42500     2018-06-27   18  CATHERINE TERRACE                 []              []       0.0
9  172000     2018-05-25   67     HOLLY CRESCENT                 []              []       0.0 

您的问题还提到了购买和出售之间的年化价值变化,按持有期结束的年份和房产类型分组,但没有房产类型列(可能是 PAON?)并且按年份分组会使表格极难阅读,所以我没有实现它。就目前而言,您在每次交易和每次价格变化之间都有持有时间,因此如果您愿意,实现使用此信息绘制年化数据的函数应该是微不足道的。

答案 1 :(得分:0)

手动计算最大和最小平均差异检查后,我不得不修改接受的解决方案,以匹配手动结果。

这些是数据库,这个函数有点慢,所以我希望更快的实现。

urls = ['http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2020.csv',
        'http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2019.csv',
        'http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2018.csv']


def holding_time(df):
    df = df.copy()

    df = df[['Price', 'DateOfTrasfer', 'Prop_Type', 'Postcode', 'PAON', 'Street']]

    df = df[df.duplicated(subset=['Postcode', 'PAON', 'Street'], keep=False)]

    cols = ['Postcode', 'PAON', 'Street']
    df['address'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)

    df['address'] = df['address'].apply(lambda x: x.replace(' ', '_'))

    df.DateOfTrasfer = pd.to_datetime(df.DateOfTrasfer)

    df['avg_price'] = df.groupby(['address'])['Price'].transform(lambda x: x.diff().mean())
    df['avg_hold'] = df.groupby(['address'])['DateOfTrasfer'].transform(lambda x: x.diff().dt.days.mean())

    df.drop_duplicates(subset=['address'], keep='first', inplace=True)

    df.drop(['Price', 'DateOfTrasfer', 'address'], axis=1, inplace=True)
    df = df.dropna()

    df['avg_hold'] = df['avg_hold'].map('Days {:.1f}'.format)
    df['avg_price'] = df['avg_price'].map('£{:,.1F}'.format)

    return df