这是我的示例数据框
Price DateOfTrasfer PAON Street
115000 2018-07-13 00:00 4 THE LANE
24000 2018-04-10 00:00 20 WOODS TERRACE
56000 2018-06-22 00:00 6 HEILD CLOSE
220000 2018-05-25 00:00 25 BECKWITH CLOSE
58000 2018-05-09 00:00 23 AINTREE DRIVE
115000 2018-06-21 00:00 4 EDEN VALE MEWS
82000 2018-06-01 00:00 24 ARKLESS GROVE
93000 2018-07-06 00:00 14 HORTON CRESCENT
42500 2018-06-27 00:00 18 CATHERINE TERRACE
172000 2018-05-25 00:00 67 HOLLY CRESCENT
这是要执行的任务:
对于在数据集中出现多次的任何地址,将持有期定义为时间 涉及该财产的任何两个连续交易之间(即 N(holding_periods) = N(appearances) - 1. 实现一个函数,获取价格支付数据并返回 持有期的平均长度和购买之间的年化价值变化 和出售,按持有期结束的年份和房产类型分组。
def holding_time(df):
df = df.copy()
# to work only with dates (day)
df.DateOfTrasfer = pd.to_datetime(df.DateOfTrasfer)
cols = ['PAON', 'Street']
df['address'] = df[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
df.drop(["PAON", 'Street'],axis=1,inplace=True)
df = df.groupby(['address', 'Price'],as_index=False).agg({'PPD':'size'})\
.rename(columns={'PPD':'count_2'})
return df
答案 0 :(得分:1)
此脚本创建包含个人持有时间、该房产的平均持有时间以及持有期间价格变化的列:
import numpy as np
import pandas as pd
# assume df is defined above ...
hdf = df.groupby("Street", sort=False).apply(lambda c: c.values[:,1]).reset_index(name='hgb')
pdf = df.groupby("Street", sort=False).apply(lambda c: c.values[:,0]).reset_index(name='pgb')
df['holding_periods'] = hdf['hgb'].apply(lambda c: np.diff(c.astype(np.datetime64)))
df['price_changes'] = pdf['pgb'].apply(lambda c: np.diff(c.astype(np.int64)))
df['holding_periods'] = df['holding_periods'].fillna("").apply(list)
df['avg_hold'] = df['holding_periods'].apply(lambda c: np.array(c).astype(np.float64).mean() if c else 0).fillna(0)
df.drop_duplicates(subset=['Street','avg_hold'], keep=False, inplace=True)
我为“Heild Close”创建了 2 个新的虚拟条目来测试它:
# Input:
Price DateOfTransfer PAON Street
0 115000 2018-07-13 4 THE LANE
1 24000 2018-04-10 20 WOODS TERRACE
2 56000 2018-06-22 6 HEILD CLOSE
3 220000 2018-05-25 25 BECKWITH CLOSE
4 58000 2018-05-09 23 AINTREE DRIVE
5 115000 2018-06-21 4 EDEN VALE MEWS
6 82000 2018-06-01 24 ARKLESS GROVE
7 93000 2018-07-06 14 HORTON CRESCENT
8 42500 2018-06-27 18 CATHERINE TERRACE
9 172000 2018-05-25 67 HOLLY CRESCENT
10 59000 2018-06-27 12 HEILD CLOSE
11 191000 2018-07-13 1 HEILD CLOSE
# Output:
Price DateOfTransfer PAON Street holding_periods price_changes avg_hold
0 115000 2018-07-13 4 THE LANE [] [] 0.0
1 24000 2018-04-10 20 WOODS TERRACE [] [] 0.0
2 56000 2018-06-22 6 HEILD CLOSE [5 days, 16 days] [3000, 132000] 10.5
3 220000 2018-05-25 25 BECKWITH CLOSE [] [] 0.0
4 58000 2018-05-09 23 AINTREE DRIVE [] [] 0.0
5 115000 2018-06-21 4 EDEN VALE MEWS [] [] 0.0
6 82000 2018-06-01 24 ARKLESS GROVE [] [] 0.0
7 93000 2018-07-06 14 HORTON CRESCENT [] [] 0.0
8 42500 2018-06-27 18 CATHERINE TERRACE [] [] 0.0
9 172000 2018-05-25 67 HOLLY CRESCENT [] [] 0.0
您的问题还提到了购买和出售之间的年化价值变化,按持有期结束的年份和房产类型分组,但没有房产类型列(可能是 PAON?)并且按年份分组会使表格极难阅读,所以我没有实现它。就目前而言,您在每次交易和每次价格变化之间都有持有时间,因此如果您愿意,实现使用此信息绘制年化数据的函数应该是微不足道的。
答案 1 :(得分:0)
手动计算最大和最小平均差异检查后,我不得不修改接受的解决方案,以匹配手动结果。
这些是数据库,这个函数有点慢,所以我希望更快的实现。
urls = ['http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2020.csv',
'http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2019.csv',
'http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2018.csv']
def holding_time(df):
df = df.copy()
df = df[['Price', 'DateOfTrasfer', 'Prop_Type', 'Postcode', 'PAON', 'Street']]
df = df[df.duplicated(subset=['Postcode', 'PAON', 'Street'], keep=False)]
cols = ['Postcode', 'PAON', 'Street']
df['address'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
df['address'] = df['address'].apply(lambda x: x.replace(' ', '_'))
df.DateOfTrasfer = pd.to_datetime(df.DateOfTrasfer)
df['avg_price'] = df.groupby(['address'])['Price'].transform(lambda x: x.diff().mean())
df['avg_hold'] = df.groupby(['address'])['DateOfTrasfer'].transform(lambda x: x.diff().dt.days.mean())
df.drop_duplicates(subset=['address'], keep='first', inplace=True)
df.drop(['Price', 'DateOfTrasfer', 'address'], axis=1, inplace=True)
df = df.dropna()
df['avg_hold'] = df['avg_hold'].map('Days {:.1f}'.format)
df['avg_price'] = df['avg_price'].map('£{:,.1F}'.format)
return df