我有如下餐厅销售明细。
+----------+------------+---------+----------+
| Location | Units Sold | Revenue | Footfall |
+----------+------------+---------+----------+
| Loc - 01 | 100 | 1,150 | 85 |
+----------+------------+---------+----------+
我想从下表的餐厅数据中找到与以上餐厅最相关的餐厅
+----------+------------+---------+----------+
| Location | Units Sold | Revenue | Footfall |
+----------+------------+---------+----------+
| Loc - 02 | 100 | 1,250 | 60 |
| Loc - 03 | 90 | 990 | 90 |
| Loc - 04 | 120 | 1,200 | 98 |
| Loc - 05 | 115 | 1,035 | 87 |
| Loc - 06 | 89 | 1,157 | 74 |
| Loc - 07 | 110 | 1,265 | 80 |
+----------+------------+---------+----------+
请指导我如何使用python或pandas完成此操作。
注意:-相关性指的是Units Sold
,Revenue
和Footfall
上最匹配/相似的餐厅。
答案 0 :(得分:4)
如果应将您的相关性描述为最小欧氏距离,则解决方案是:
#convert columns to numeric
df1['Revenue'] = df1['Revenue'].str.replace(',','').astype(int)
df2['Revenue'] = df2['Revenue'].str.replace(',','').astype(int)
#distance of all columns subtracted by first row of first DataFrame
dist = np.sqrt((df2['Units Sold']-df1.loc[0, 'Units Sold'])**2 +
(df2['Revenue']- df1.loc[0, 'Revenue'])**2 +
(df2['Footfall']- df1.loc[0, 'Footfall'])**2)
print (dist)
0 103.077641
1 160.390149
2 55.398556
3 115.991379
4 17.058722
5 115.542200
dtype: float64
#get index of minimal value and select row of second df
print (df2.loc[[dist.idxmin()]])
Location Units Sold Revenue Footfall
4 Loc - 06 89 1157 74
答案 1 :(得分:2)
可能是执行此操作的更好方法,但是我认为这很有效,因为它很冗长,所以我尝试使代码保持干净和可读性:
首先,让我们使用this帖子中的自定义numpy函数。
import numpy as np
import pandas as pd
def find_nearest(array, value):
array = np.asarray(array)
idx = (np.abs(array - value)).argmin()
return array[idx]
然后使用数据框的数组,传入第一个数据框的值以找到最接近的匹配项。
us = find_nearest(df2['Units Sold'],df['Units Sold'][0])
ff = find_nearest(df2['Footfall'],df['Footfall'][0])
rev = find_nearest(df2['Revenue'],df['Revenue'][0])
print(us,ff,rev,sep=',')
100,87,1157
然后返回具有所有三个条件的数据帧
new_ df = (df2.loc[
(df2['Units Sold'] == us) |
(df2['Footfall'] == ff) |
(df2['Revenue'] == rev)])
这给了我们:
Location Units Sold Revenue Footfall
0 Loc - 02 100 1250 60
3 Loc - 05 115 1035 87
4 Loc - 06 89 1157 74
答案 2 :(得分:2)
对于数字列。我可能对此概括了太多。另外,我将索引设置为'Location'
列
def fix(d):
d.update(
d.astype(str).replace(',', '', regex=True)
.apply(pd.to_numeric, errors='ignore')
)
d.set_index('Location', inplace=True)
fix(df1)
fix(df2)
df2.loc[[df2.sub(df1.loc['Loc - 01']).abs().sum(1).idxmin()]]
Units Sold Revenue Footfall
Location
Loc - 06 89 1157 74
df2.loc[[df2.sub(df1.loc['Loc - 01']).pow(2).sum(1).pow(.5).idxmin()]]
Units Sold Revenue Footfall
Location
Loc - 06 89 1157 74