我认为两者都应给出相同的答案:
train = pd.read_csv('https://raw.github.com/mattdelhey/kaggle-titanic/master/Data/train.csv')
train.name.str.contains('Mr.').sum()
(train.name.str.find('Mr.')>0).sum()
但输出是:
647
517
结果不同的原因是什么?
答案 0 :(得分:1)
差异str.contains
也匹配Mrs.
,因为.
是特殊的正则表达式字符(用于匹配任何字符)。
我认为需要对其进行转义或添加参数regex=False
:
print(train.name.str.contains('Mr\.').sum())
517
print(train.name.str.contains('Mr.', regex=False).sum())
517
print((train.name.str.find('Mr.')>0).sum())
517
测试差异:
a = train.loc[train.name.str.contains('Mr.'), 'name']
b = train.loc[(train.name.str.find('Mr.')>0), 'name']
c = pd.concat([a, b], axis=1, keys=('contains','find'))
c = c[c.isnull().any(axis=1)]
print (c)
contains find
1 Cumings, Mrs. John Bradley (Florence Briggs Th... NaN
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) NaN
8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) NaN
9 Nasser, Mrs. Nicholas (Adele Achem) NaN
15 Hewlett, Mrs. (Mary D Kingcome) NaN
18 Vander Planke, Mrs. Julius (Emelia Maria Vande... NaN
19 Masselmani, Mrs. Fatima NaN
25 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... NaN
31 Spencer, Mrs. William Augustus (Marie Eugenie) NaN
40 Ahlin, Mrs. Johan (Johanna Persdotter Larsson) NaN
41 Turpin, Mrs. William John Robert (Dorothy Ann ... NaN
49 Arnold-Franchi, Mrs. Josef (Josefine Franchi) NaN
52 Harper, Mrs. Henry Sleeper (Myna Haxtun) NaN
53 Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkin... NaN
66 Nye, Mrs. (Elizabeth Ramell) NaN
85 Backstrom, Mrs. Karl Alfred (Maria Mathilda Gu... NaN
...
...