我有一个pandas数据框,其中32个维嵌入物存储为名为 Embeddings 的 pandas.core.series.Series 列。
0,8431,-7.73677,110.372814,ID_YGK179,Indonesia,96,Yogyakarta,625,"[-0.08708319 -0.9635474
-1.075278 -0.8778672 1.0672983 0.21834892
0.10251518 -1.4207497 -1.3847003 -0.7889203 -0.58245313 -1.2558284
-0.44232526 -2.44585 1.3060646 -0.6015553 0.21264891 -0.62279683
-0.4118958 -0.10933076 0.2864734 0.42591774 0.35520273 -1.2562522
-1.3118799 0.1367726 0.89168227 0.08609396 -0.7965635 0.03220405
-1.2149535 0.06975704]"
1,8425,-8.82022551263183,115.171107687056,ID_BLI079,Indonesia,96,Bali,623,"[ 0.20398486 -0.3435272 -1.8947698 -1.0723802 1.2999498 0.211587
0.16329497 -0.09804655 -0.41587254 -0.09957021 0.8152087 -0.6022888
-0.10874949 -1.4237555 -0.02137504 -0.60817945 0.81695604 -0.0106029
1.2845753 0.18705958 0.5555717 0.53619224 1.6209115 1.3571581
-0.1660664 0.12530853 -0.12268435 -0.19951908 0.27602577 -0.66749376
-0.09328692 -0.07952076]"
2,8431,-8.23575827574888,114.351026639342,ID_BWI026,Indonesia,96,Banyuwangi,770,"[-0.14250259 -0.60264546 0.39676255 -0.24801618 0.61574996 -0.5373072
0.97321934 -0.22758694 -0.8498406 -0.86897266 0.565802 -1.383025
-0.16449492 -1.6958055 -0.25523412 -0.50068396 0.36182633 -1.5886943
0.56873196 -0.42583758 -0.16461776 0.12368935 1.470881 0.23292007
-1.2004089 0.34835646 0.48000658 0.27867964 -0.35181814 0.20428348
0.04278001 -0.16710897]"
Embeddings是给定的三行样本数据的最后一列。我想将数据与第2列(guest_id)(8431、8425、8431)分组,然后计算每组嵌入数组的平均值。
我尝试使用以下代码,但是变量 a 仅包含单个numpy数组,随后 zip 函数不起作用。
#Get the average of n 32 dimension embeddings
def get_average(values):
a = np.array(values.values)
a = np.array(a[0].split()[1:-1]).astype(float)
print(a.shape) # Returns (32,) n number of times
return ([float(sum(col))/len(col) for col in zip(*a)])
#Read embeddings CSV file
hotelFrame = pd.read_csv('96_embedding')
hotelFrame = hotelFrame.iloc[: , [1, -1]] # select only 2 columns, guest_id and embedding
hotelFrame.columns = ['guest_id', 'embedding']
print(type(hotelFrame.embedding)) # Returns <class 'pandas.core.series.Series'>
average_embeddings = result.groupby("guest_id").embedding.agg(get_average).to_frame()
错误:TypeError:zip参数1必须支持迭代
如何在输出中获取guest_id,embeddings_average数据帧?我究竟做错了什么?
答案 0 :(得分:1)
您可以创建辅助列avg_embedding
,然后执行常规的groupby
:
df['avg_embedding'] = df.embedding.apply(lambda x: pd.np.fromstring(x[1:-1], sep=' ').mean())
df.groupby("guest_id").avg_embedding.mean()
结果:
guest_id
8425 0.044565
8431 -0.268278
Name: avg_embedding, dtype: float64