我正在尝试两个比较名称,它们在pandas数据框中的名称列中相似.nameMatching是一个将两个列表放入并比较它们并生成一个得分的函数.sample_data是一个spark datafarme。假设我有一个spark这样的数据框:
phone,name
1234,sultan
1234,multan
1234,john
然后我要像这样将熊猫分组地图udf输出:
sultan,multan,0.6
sultan,john,0.1
multan,john,0.1
这是我的代码:
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def apply_model(sample_pd):
name1=[]
name2=[]
score=[]
for i in range(len(sample_pd)-1):
n1 = sample_pd.iloc[i, 4]
for j in range(i+1,len(sample_pd)):
n2 = sample_pd.iloc[j, 4]
name2.append(n2)
name1.append(n1)
sc = nameMatching(list(n1), list(n2))
score.append(sc)
return pd.DataFrame({'S': name1, 'D': name2, 'Score':score})
results = sample_data.groupby('MOBILE NUMBER').apply(apply_model)
我收到以下错误:
File "<command-1928005850465337>", line 19, in apply_model
File "<command-1928005850465317>", line 28, in nameMatching
IndexError: list assignment index out of range