熊猫中每组嵌入数组的平均值

时间:2019-12-08 18:33:03

标签: python pandas numpy numpy-ndarray

我有一个pandas数据框,其中32个维嵌入物存储为名为 Embeddings pandas.core.series.Series 列。

0,8431,-7.73677,110.372814,ID_YGK179,Indonesia,96,Yogyakarta,625,"[-0.08708319 -0.9635474 
 -1.075278   -0.8778672   1.0672983   0.21834892
  0.10251518 -1.4207497  -1.3847003  -0.7889203  -0.58245313 -1.2558284
 -0.44232526 -2.44585     1.3060646  -0.6015553   0.21264891 -0.62279683
 -0.4118958  -0.10933076  0.2864734   0.42591774  0.35520273 -1.2562522
 -1.3118799   0.1367726   0.89168227  0.08609396 -0.7965635   0.03220405
 -1.2149535   0.06975704]"

1,8425,-8.82022551263183,115.171107687056,ID_BLI079,Indonesia,96,Bali,623,"[ 0.20398486 -0.3435272  -1.8947698  -1.0723802   1.2999498   0.211587
  0.16329497 -0.09804655 -0.41587254 -0.09957021  0.8152087  -0.6022888
 -0.10874949 -1.4237555  -0.02137504 -0.60817945  0.81695604 -0.0106029
  1.2845753   0.18705958  0.5555717   0.53619224  1.6209115   1.3571581
 -0.1660664   0.12530853 -0.12268435 -0.19951908  0.27602577 -0.66749376
 -0.09328692 -0.07952076]"

2,8431,-8.23575827574888,114.351026639342,ID_BWI026,Indonesia,96,Banyuwangi,770,"[-0.14250259 -0.60264546  0.39676255 -0.24801618  0.61574996 -0.5373072
  0.97321934 -0.22758694 -0.8498406  -0.86897266  0.565802   -1.383025
 -0.16449492 -1.6958055  -0.25523412 -0.50068396  0.36182633 -1.5886943
  0.56873196 -0.42583758 -0.16461776  0.12368935  1.470881    0.23292007
 -1.2004089   0.34835646  0.48000658  0.27867964 -0.35181814  0.20428348
  0.04278001 -0.16710897]"

Embeddings是给定的三行样本数据的最后一列。我想将数据与第2列(guest_id)(8431、8425、8431)分组,然后计算每组嵌入数组的平均值。

我尝试使用以下代码,但是变量 a 仅包含单个numpy数组,随后 zip 函数不起作用。

#Get the average of n 32 dimension embeddings
def get_average(values):
    a = np.array(values.values)
    a = np.array(a[0].split()[1:-1]).astype(float)
    print(a.shape)  # Returns (32,) n number of times
    return ([float(sum(col))/len(col) for col in zip(*a)])

#Read embeddings CSV file
hotelFrame = pd.read_csv('96_embedding')
hotelFrame = hotelFrame.iloc[: , [1, -1]] # select only 2 columns, guest_id and embedding
hotelFrame.columns = ['guest_id', 'embedding']
print(type(hotelFrame.embedding)) # Returns <class 'pandas.core.series.Series'>

average_embeddings = result.groupby("guest_id").embedding.agg(get_average).to_frame() 

错误:TypeError:zip参数1必须支持迭代

如何在输出中获取guest_id,embeddings_average数据帧?我究竟做错了什么?

1 个答案:

答案 0 :(得分:1)

您可以创建辅助列avg_embedding,然后执行常规的groupby

df['avg_embedding'] = df.embedding.apply(lambda x: pd.np.fromstring(x[1:-1], sep=' ').mean())
df.groupby("guest_id").avg_embedding.mean()

结果:

guest_id
8425    0.044565
8431   -0.268278
Name: avg_embedding, dtype: float64