因此,我正在尝试解决此pandas
练习。我从Kaggle获得了房地产公司的数据集,数据框df
看起来像这样。
id location type price
0 44525 Golden Mile House 4400000
1 44859 Nagüeles House 2400000
2 45465 Nagüeles House 1900000
3 50685 Nagüeles Plot 4250000
4 130728 Golden Mile House 32000000
5 130856 Nagüeles Plot 2900000
6 130857 Golden Mile House 3900000
7 130897 Golden Mile House 3148000
8 3484102 Marinha Plot 478000
9 3484124 Marinha Plot 2200000
10 3485461 Marinha House 1980000
所以现在,我必须根据列location
和type
来找出哪个属性被低估或高估了,哪个具有真实的价格。所需的结果应如下所示:
id location type price Over_val Under_val Norm_val
0 44525 Golden Mile House 4400000 0 0 1
1 44859 Nagüeles House 2400000 0 0 1
2 45465 Nagüeles House 1900000 0 0 1
3 50685 Nagüeles Plot 4250000 0 1 0
4 130728 Golden Mile House 32000000 1 0 0
5 130856 Nagüeles Plot 2900000 0 1 0
6 130857 Golden Mile House 3900000 0 0 1
7 130897 Golden Mile House 3148000 0 0 1
8 3484102 Marinha Plot 478000 0 0 1
9 3484124 Marinha Plot 2200000 0 0 1
10 3485461 Marinha House 1980000 0 1 0
已经停留了一段时间。解决这个问题应该尝试什么逻辑?
答案 0 :(得分:3)
这是我的解决方案。说明包括在内嵌注释中。可能可以通过较少的步骤来完成此操作。我也会有兴趣学习。
import pandas as pd
# Replace this with whatever you have to load your data. This is set up for a sample data file I used
df = pd.read_csv('my_sample_data.csv', encoding='latin-1')
# Mean by location - type
mdf = df.set_index('id').groupby(['location','type'])['price'].mean().rename('mean').to_frame().reset_index()
# StdDev by location - type
sdf = df.set_index('id').groupby(['location','type'])['price'].std().rename('sd').to_frame().reset_index()
# Merge back into the original dataframe
df = df.set_index(['location','type']).join(mdf.set_index(['location','type'])).reset_index()
df = df.set_index(['location','type']).join(sdf.set_index(['location','type'])).reset_index()
# Add the indicator columns
df['Over_val'] = 0
df['Under_val'] = 0
df['Normal_val'] = 0
# Update the indicators
df.loc[df['price'] > df['mean'] + 2 * df['sd'], 'Over_val'] = 1
df.loc[df['price'] < df['mean'] - 2 * df['sd'], 'Under_val'] = 1
df['Normal_val'] = df['Over_val'] + df['Under_val']
df['Normal_val'] = df['Normal_val'].apply(lambda x: 1 if x == 0 else 0)
答案 1 :(得分:2)
这是另一种可能的方法。在2个标准偏差下,没有合格属性。一个标准开发者只有一个财产。
import pandas as pd
df = pd.DataFrame(data={}, columns=["id", "location", "type", "price"])
# data is already entered, left out for this example
df["id"] = prop_id
df["location"] = location
df["type"] = prop_type
df["price"] = price
# a function that returns the mean and standard deviation
def mean_std_dev(row):
mask1 = df["location"] == row["location"]
mask2 = df["type"] == row["type"]
df_filt = df[mask1 & mask2]
mean_price = df_filt["price"].mean()
std_dev_price = df_filt["price"].std()
return [mean_price, std_dev_price]
# create two columns and populate with the mean and std dev from function mean_std_dev
df[["mean", "standard deviation"]] = df.apply(
lambda row: pd.Series(mean_std_dev(row)), axis=1
)
# create final columns
df["Over_val"] = df.apply(
lambda x: 1 if x["price"] > x["mean"] + x["standard deviation"] else 0, axis=1
)
df["Under_val"] = df.apply(
lambda x: 1 if x["price"] < x["mean"] - x["standard deviation"] else 0, axis=1
)
df["Norm_val"] = df.apply(
lambda x: 1 if x["Over_val"] + x["Under_val"] == 0 else 0, axis=1
)
# delete the mean and standard deviation columns
df.drop(["mean", "standard deviation"], axis=1)