我正在阅读编程集体智慧,并以比书中所写的方式更加抒情的方式编写一些代码,仅仅是为了学习。
第一章是关于推荐系统。基于下一个字典,提出了一些相似性度量。
critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane':
3.5,
'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,
'The Night Listener': 3.0},
'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,
'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,
'You, Me and Dupree': 3.5},
'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,
'Superman Returns': 3.5, 'The Night Listener': 4.0},
'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,
'The Night Listener': 4.5, 'Superman Returns': 4.0,
'You, Me and Dupree': 2.5},
'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,
'You, Me and Dupree': 2.0},
'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},
'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}
鉴于unique_pairs是包含不同可能的人对的元组列表,
unique_pairs = list(itertools.combinations(people, 2))
unique_pairs
[('Michael Phillips', 'Mick LaSalle'),
('Michael Phillips', 'Lisa Rose'),
('Michael Phillips', 'Toby'),
('Michael Phillips', 'Jack Matthews'),
('Michael Phillips', 'Gene Seymour'),
('Michael Phillips', 'Claudia Puig'),
('Mick LaSalle', 'Lisa Rose'),
('Mick LaSalle', 'Toby'),
('Mick LaSalle', 'Jack Matthews'),
('Mick LaSalle', 'Gene Seymour'),
('Mick LaSalle', 'Claudia Puig'),
('Lisa Rose', 'Toby'),
('Lisa Rose', 'Jack Matthews'),
('Lisa Rose', 'Gene Seymour'),
('Lisa Rose', 'Claudia Puig'),
('Toby', 'Jack Matthews'),
('Toby', 'Gene Seymour'),
('Toby', 'Claudia Puig'),
('Jack Matthews', 'Gene Seymour'),
('Jack Matthews', 'Claudia Puig'),
('Gene Seymour', 'Claudia Puig')]
我尝试通过在函数结果中添加p值来改进本书中提出的Pearson Correlation相似度函数,只有在函数的参数p_value为真时才输出。该功能以这种方式定义:
def sim_pearson(prefs, p1, p2, p_value=False):
"""Returns the pearson correlation coefficient and the p-value (optional)
of the ratings of the movies that both p1 and p2 have rated"""
# Creates a list with the movies that both p1 and p2 have rated
movies = [movie for movie in prefs[p1] if movie in prefs[p2]]
# List of the scores that both p1 and p2 have given to the movies in common
scores_p1 = [prefs[p1][movie] for movie in movies]
scores_p2 = [prefs[p2][movie] for movie in movies]
corr, p_value = scipy.stats.pearsonr(scores_p1, scores_p2)
if p_value:
return (corr, p_value)
else:
return corr
我的问题是函数没有按预期工作,因为当p值为True时它不会返回(相关系数,p值)的元组,并且当它返回时它会产生相同的结果p_value为True,因为它为false。为什么会发生这种情况?我该如何解决?
这是一个列表,其中包含将函数应用于每个可能的人对的结果,以查看我所说的内容。结果与p_value = True一样,p_value = False,我只是粘贴前一种情况。
pearson_results = [(pair[0][:5],
pair[1][:5],
sim_pearson(critics, pair[0], pair[1], p_value=True))
for pair in unique_pairs]
pearson_results
[('Micha', 'Mick ', (-0.2581988897471611, 0.74180111025283857)),
('Micha', 'Lisa ', (0.40451991747794525, 0.59548008252205464)),
('Micha', 'Toby', -1.0),
('Micha', 'Jack ', (0.13483997249264842, 0.8651600275073511)),
('Micha', 'Gene ', (0.20459830184114206, 0.79540169815885797)),
('Micha', 'Claud', 1.0),
('Mick ', 'Lisa ', (0.59408852578600457, 0.21370636293028805)),
('Mick ', 'Toby', (0.92447345164190498, 0.24901011701138964)),
('Mick ', 'Jack ', (0.21128856368212914, 0.73299431171284912)),
('Mick ', 'Gene ', (0.41176470588235292, 0.41726032973743138)),
('Mick ', 'Claud', (0.56694670951384085, 0.3189317919127756)),
('Lisa ', 'Toby', (0.99124070716193036, 0.084323216321943714)),
('Lisa ', 'Jack ', (0.74701788083399601, 0.14681146067336839)),
('Lisa ', 'Gene ', (0.39605901719066977, 0.43697492654267506)),
('Lisa ', 'Claud', (0.56694670951384085, 0.3189317919127756)),
('Toby', 'Jack ', (0.66284898035987017, 0.53869426797895403)),
('Toby', 'Gene ', (0.38124642583151169, 0.75098988298861025)),
('Toby', 'Claud', (0.89340514744156441, 0.29661883133160016)),
('Jack ', 'Gene ', (0.96379568187563314, 0.0082243534847899202)),
('Jack ', 'Claud', (0.028571428571428571, 0.9714285714285712)),
('Gene ', 'Claud', (0.31497039417435602, 0.60570041941160946))]
答案 0 :(得分:0)
将功能的底部更改为:
corr, p_value2 = scipy.stats.pearsonr(scores_p1, scores_p2)
if p_value:
return (corr, p_value2)
else:
return corr