我有一个json,我正在转换为字典,然后我使用字典中存在的某些键值对创建数据框
# json
a = """{
"cluster_id": 3,
"cluster_observation_data": [[1, 2, 3, 4, 5, 6, 7, 8], [2, 3, 4, 5, 6, 7, 8, 1]],
"cluster_observation_label": [0, 1],
"cluster_centroid": [1, 2, 3, 4, 5, 6, 7, 10],
"observation_id":["id_xyz_999","id_abc_000"]
}"""
# convert to dictionary
data = json.loads(a)
sub_dict = dict((k, data[k]) for k in ('cluster_observation_data', 'cluster_observation_label'))
train = pd.DataFrame.from_dict(sub_dict, orient='columns')
将其转换为ddataframe后,我试图计算其与cluster_centroid
字典中存在的data
的欧氏距离。该函数工作正常,但在最终的train
数据框中,我得到了NaNs
def distance_from_center(row):
centre = data['cluster_centroid']
obs_data = row[0]
print('obs_data', obs_data)
print('\n\n\n\n')
print('center', centre)
# print(type(obs_data))
# print(type(centre))
dist = sum([(a - b)**2 for a, b in zip(centre, obs_data)])
print(dist)
return dist
train.loc[:, 'center_dist'] = train.loc[:, ['cluster_observation_data']].apply(distance_from_center)
我无法确定我出错的地方。即使是一个小提示也可以。
答案 0 :(得分:1)
您需要传递轴,例如:
train.loc[:, 'center_dist'] = train.loc[:, ['cluster_observation_data']].apply(distance_from_center, 1)
原因是您希望将功能应用于每个列表。 Documentation说:
1或'columns':将函数应用于每一行
答案 1 :(得分:0)
只需将 distance_from_center()中 obs_data 的值从行[0] 更改为行在调用该方法时已经占用了该特定列。然后它应该工作得很好。我尝试了它,它在我的系统中工作。
import json
import pandas as pd
# json
a = """{"cluster_id": 3,"cluster_observation_data": [[1, 2, 3, 4, 5,6, 7, 8], [2, 3, 4,5, 6, 7, 8, 1]],"cluster_observation_label": [0, 1],
"cluster_centroid": [1, 2, 3, 4, 5, 6, 7, 10],
"observation_id":["id_xyz_999","id_abc_000"]}"""
# convert to dictionary
data = json.loads(a)
sub_dict = dict((k, data[k]) for k in ('cluster_observation_data',
'cluster_observation_label'))
train = pd.DataFrame.from_dict(sub_dict, orient='columns')
def distance_from_center(row):
centre = data['cluster_centroid']
obs_data = row
print('obs_data', obs_data)
print('\n\n\n\n')
print('center', centre)
# print(type(obs_data))
# print(type(centre))
dist = sum([(a - b)**2 for a,b in zip(centre, obs_data)])
print(dist)
return dist
train.loc[:, 'center_dist'] = train.loc[:,'cluster_observation_data'].apply(distance_from_center)
输出:
obs_data [1, 2, 3, 4, 5, 6, 7, 8]
center [1, 2, 3, 4, 5, 6, 7, 10]
4
obs_data [2, 3, 4, 5, 6, 7, 8, 1]
center [1, 2, 3, 4, 5, 6, 7, 10]
88