我有一个像这样的熊猫系列:
(['StartGame', 'TutorialEnded', 'FBConnect',
'StartGame', 'Sale', 'FBConnect', 'InviteSent',
'StartGame', 'Finish_1', 'Sale', 'Bought',
'Finish_22', 'FBConnect', 'Finish_2',
'TutorialEnded', 'Finish_18', ...])
我想绘制包含字符串Finish
的值与值sale
的外观之间的距离,以查看两者之间是否存在任何相关性,以及检查两者之间的相关性。与sale
相关的其他词语的出现。换句话说,我可以使用系列中任何值的外观来预测附近sale
的出现吗?即使绘制一条散点线,我为每个值分配不同的颜色,这样我就能感觉到它会有所帮助,但我不知道该怎么做。
答案 0 :(得分:1)
df = pd.DataFrame(['StartGame', 'TutorialEnded', 'FBConnect',
'StartGame', 'Sale', 'FBConnect', 'InviteSent',
'StartGame', 'Finish_1', 'Sale', 'Bought',
'Finish_22', 'FBConnect', 'Finish_2',
'TutorialEnded', 'Finish_18'], columns=['Value'])
df.index.name = 'position'
df.reset_index(inplace=True)
def isFinish(x):
"""Returns True if Value matches 'Finish', False otherwise."""
return bool(re.match(r'.*Finish.*', x.ix['Value']))
def isSale(x):
"""Returns True if Value matches 'Sale', False otherwise."""
return bool(re.match(r'.*Sale.*', x.ix['Value']))
df['Finish'] = df.apply(isFinish, axis=1)
df['Sale'] = df.apply(isSale, axis=1)
df['FinishCount'] = df.Finish.cumsum()
def cumargmax(x):
"""get latest position of a Finish row."""
if x.ix['FinishCount'] == 0:
return np.nan
else:
return df.FinishCount.loc[:x.ix['position']].argmax()
df['Distance'] = df.position - df.apply(cumargmax, axis=1)
print df
position Value Finish Sale FinishCount Distance
0 0 StartGame False False 0 NaN
1 1 TutorialEnded False False 0 NaN
2 2 FBConnect False False 0 NaN
3 3 StartGame False False 0 NaN
4 4 Sale False True 0 NaN
5 5 FBConnect False False 0 NaN
6 6 InviteSent False False 0 NaN
7 7 StartGame False False 0 NaN
8 8 Finish_1 True False 1 0.0
9 9 Sale False True 1 1.0
10 10 Bought False False 1 2.0
11 11 Finish_22 True False 2 0.0
12 12 FBConnect False False 2 1.0
13 13 Finish_2 True False 3 0.0
14 14 TutorialEnded False False 3 1.0
15 15 Finish_18 True False 4 0.0
或者在有销售时的子集
print df[df.Sale]
position Value Finish Sale FinishCount Distance
4 4 Sale False True 0 NaN
9 9 Sale False True 1 1.0