这是我的数据框,
df = pd.DataFrame({'Id': [102,103,104,303,305],'ExpG_Home':[1.8,1.5,1.6,1.8,2.9],
'ExpG_Away':[2.2,1.3,1.2,2.8,0.8],
'HomeG_Time':[[93, 109, 187],[169], [31, 159],[176],[16, 48, 66, 128]],
'AwayG_Time':[[90, 177],[],[],[123,136],[40]]})
首先,我需要创建一个数组y
,对于给定的ID号,它需要来自同一行(ExpG_Home & ExpG_Away
)的值。
y = [1 - (ExpG_Home + ExpG_Away), ExpG_Home, ExpG_Away]
第二,我发现这要困难得多,对于创建y
所使用的ID,下面的函数从HomeG_Time & AwayG_Time
中获取相应的列表并创建一个数组。不幸的是,我的函数一次只占用一行。我需要对大型数据集执行此操作。
x1 = [1,0,0]
x2 = [0,1,0]
x3 = [0,0,1]
total_timeslot = 200 # number of timeslot per game.
k = 1 # constant
#For Id=102 with ExpG_Home=2.2 and ExpG_Away=1.8
HomeG_Time = [93, 109, 187]
AwayG_Time = [90, 177]
y = np.array([1-(2.2 + 1.8)/k, 2.2/k, 1.8/k])
# output of y = [0.98 , 0.011, 0.009]
def squared_diff(x1, x2, x3, y):
ssd = []
for k in range(total_timeslot):
if k in HomeG_Time:
ssd.append(sum((x2 - y) ** 2))
elif k in AwayG_Time:
ssd.append(sum((x3 - y) ** 2))
else:
ssd.append(sum((x1 - y) ** 2))
return ssd
sum(squared_diff(x1, x2, x3, y))
Out[37]: 7.880400000000012
此输出仅用于第一行。
答案 0 :(得分:2)
这是给出的完整代码段,
>>> import numpy as np
>>> x1 = np.array( [1,0,0] )
>>> x2 = np.array( [0,1,0] )
>>> x3 = np.array( [0,0,1] )
>>> total_timeslot = 200
>>> HomeG_Time = [93, 109, 187]
>>> AwayG_Time = [90, 177]
>>> ExpG_Home=2.2
>>> ExpG_Away=1.8
>>> y = np.array( [1 - (ExpG_Home + ExpG_Away), ExpG_Home, ExpG_Away] )
>>> def squared_diff(x1, x2, x3, y):
... ssd = []
... for k in range(total_timeslot):
... if k in HomeG_Time:
... ssd.append(sum((x2 - y) ** 2))
... elif k in AwayG_Time:
... ssd.append(sum((x3 - y) ** 2))
... else:
... ssd.append(sum((x1 - y) ** 2))
... return ssd
...
>>> sum(squared_diff(x1, x2, x3, y))
4765.599999999989
将y计算为(N,3)
>>> y = np.array( df.apply(lambda row: [1 - (row.ExpG_Home + row.ExpG_Away),
... row.ExpG_Home, row.ExpG_Away ],
... axis=1).tolist() )
>>> y.shape
(5, 3)
对于给定x,现在计算平方误差
>>> def squared_diff(x, y):
... return np.sum( np.square(x - y), axis=1)
在您的情况下,如果error2
是squared_diff(x2,y)
,则您要添加HomeG_Time
的出现次数
>>> n3 = df.AwayG_Time.apply(len)
>>> n2 = df.HomeG_Time.apply(len)
>>> n1 = 200 - (n2 + n3)
最终平方误差总和是(根据您的计算)
>>> squared_diff(x1, y) * n1 + squared_diff(x2, y) * n2 + squared_diff(x3, y) * n3
0 4766.4
1 2349.4
2 2354.4
3 6411.6
4 4496.2
dtype: float64
>>>
答案 1 :(得分:1)
def squared_diff(row):
y = np.array([1 - (row.ExpG_Home + row.ExpG_Away), row.ExpG_Home, row.ExpG_Away])
HomeG_Time = row.HomeG_Time
AwayG_Time = row.AwayG_Time
x1 = np.array([1, 0, 0])
x2 = np.array([0, 1, 0])
x3 = np.array([0, 0, 1])
total_timeslot = 200
ssd = []
for k in range(total_timeslot):
if k in HomeG_Time:
ssd.append(sum((x2 - y) ** 2))
elif k in AwayG_Time:
ssd.append(sum((x3 - y) ** 2))
else:
ssd.append(sum((x1 - y) ** 2))
return sum(ssd)
df.apply(squared_diff, axis=1)
Out[]:
0 4766.4
1 2349.4
2 2354.4
3 6411.6
4 4496.2
答案 2 :(得分:1)
尝试一下
import pandas as pd
import numpy as np
df = pd.DataFrame({'Id': [102,103,104,303,305],'ExpG_Home':[1.8,1.5,1.6,1.8,2.9],
'ExpG_Away':[2.2,1.3,1.2,2.8,0.8],
'HomeG_Time':[[93, 109, 187],[169], [31, 159],[176],[16, 48, 66, 128]],
'AwayG_Time':[[90, 177],[],[],[123,136],[40]]})
x1 = [1,0,0]
x2 = [0,1,0]
x3 = [0,0,1]
k=1
total_timeslot = 200 # number of timeslot per game.
def squared_diff(x1, x2, x3,AwayG_Time,HomeG_Time, y):
ssd = []
for k in range(total_timeslot):
if k in HomeG_Time:
ssd.append(sum((x2 - y) ** 2))
elif k in AwayG_Time:
ssd.append(sum((x3 - y) ** 2))
else:
ssd.append(sum((x1 - y) ** 2))
return ssd
s=pd.DataFrame( pd.concat([df,1-(df['ExpG_Home']+df['ExpG_Away'])/k,df['ExpG_Home']/k,df['ExpG_Away']/k],axis=1).values)
df['res']=s.apply(lambda x: sum(squared_diff(x1,x2,x3,x[0],x[3],np.array([x[5],x[6],x[7]]))),axis=1)
del s
print df
输出:
AwayG_Time ExpG_Away ExpG_Home HomeG_Time Id res
0 [90, 177] 2.2 1.8 [93, 109, 187] 102 4766.4
1 [] 1.3 1.5 [169] 103 2349.4
2 [] 1.2 1.6 [31, 159] 104 2354.4
3 [123, 136] 2.8 1.8 [176] 303 6411.6
4 [40] 0.8 2.9 [16, 48, 66, 128] 305 4496.2