所以,我有关于飞机失事的数据框。
In []: df = pd.read_csv('Airplane_Crashes_and_Fatalities_Since_1908.csv')
In []: df.info()
In []: df.head()
Out []:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5268 entries, 0 to 5267
Data columns (total 13 columns):
Date 5268 non-null object
Time 3049 non-null object
Location 5248 non-null object
Operator 5250 non-null object
Flight # 1069 non-null object
Route 3562 non-null object
Type 5241 non-null object
Registration 4933 non-null object
cn/In 4040 non-null object
Aboard 5246 non-null float64
Fatalities 5256 non-null float64
Ground 5246 non-null float64
Summary 4878 non-null object
dtypes: float64(3), object(10)
memory usage: 535.1+ KB
Out []:
Date Time Location \
0 09/17/1908 17:18 Fort Myer, Virginia
1 07/12/1912 06:30 AtlantiCity, New Jersey
2 08/06/1913 NaN Victoria, British Columbia, Canada
3 09/09/1913 18:30 Over the North Sea
4 10/17/1913 10:30 Near Johannisthal, Germany
Operator Flight # Route Type \
0 Military - U.S. Army NaN Demonstration Wright Flyer III
1 Military - U.S. Navy NaN Test flight Dirigible
2 Private - NaN Curtiss seaplane
3 Military - German Navy NaN NaN Zeppelin L-1 (airship)
4 Military - German Navy NaN NaN Zeppelin L-2 (airship)
Registration cn/In Aboard Fatalities Ground \
0 NaN 1 2.0 1.0 0.0
1 NaN NaN 5.0 5.0 0.0
2 NaN NaN 1.0 1.0 0.0
3 NaN NaN 20.0 14.0 0.0
4 NaN NaN 30.0 30.0 0.0
Summary
0 During a demonstration flight, a U.S. Army fly...
1 First U.S. dirigible Akron exploded just offsh...
2 The first fatal airplane accident in Canada oc...
3 The airship flew into a thunderstorm and encou...
4 Hydrogen gas which was being vented was sucked...
所以我想对运营商&#39;进行分类。列并创建包含平面类型的new。 我尝试使用.apply()和正则表达式:
def plane_type(plane):
m = re.search('\w*Military', plane)
p = re.search('\w*Private', plane)
if m:
return 'Military'
elif p:
return 'Private'
else:
return 'Passengers'
df['plane_type'] = df['operator'].apply(plane_type)
还尝试使用lambda:
df['plane_type'] = df['operator'].apply(lambda x: plane_type(x))
每次我收到TypeError后结束:
TypeError: expected string or buffer
请有人告诉我,我失踪了什么?
答案 0 :(得分:0)
我认为您可以先使用extract
(^
仅提取字符串开头的值),然后使用fillna
获取缺失值:
df['plane_type'] = df.Operator.str.extract('(^Military|^Private)', expand=False)
df['plane_type'] = df['plane_type'].fillna('Passengers')
样品:
df = pd.DataFrame({'Operator':['Military - U.S. Navy','Private',
'Another Military - German', 'Other']})
print (df)
Operator
0 Military - U.S. Navy
1 Private
2 Another Military - German
3 Other
df['plane_type'] = df.Operator.str.extract('(^Military|^Private)', expand=False)
df['plane_type'] = df['plane_type'].fillna('Passengers')
print (df)
Operator plane_type
0 Military - U.S. Navy Military
1 Private Private
2 Another Military - German Passengers
3 Other Passengers
此外,如果需要通过关键字提取所有值,请忽略^
:
df['plane_type'] = df.Operator.str.extract('(Military|Private)', expand=False)
df['plane_type'] = df['plane_type'].fillna('Passengers')
print (df)
Operator plane_type
0 Military - U.S. Navy Military
1 Private Private
2 Another Military - German Military
3 Other Passengers
<强>计时强>:
apply
更慢:
#400k rows
In [80]: %timeit df['plane_type'] = df.Operator.str.extract('(Military|Private)', expand=False).fillna('Passengers')
1 loop, best of 3: 711 ms per loop
In [81]: %timeit df['plane_type1'] = df['Operator'].astype(str).apply(plane_type)
1 loop, best of 3: 1.69 s per loop
#6k rows
In [84]: %timeit df['plane_type'] = df.Operator.str.extract('(Military|Private)', expand=False).fillna('Passengers')
100 loops, best of 3: 10.8 ms per loop
In [85]: %timeit df['plane_type1'] = df['Operator'].astype(str).apply(plane_type)
10 loops, best of 3: 25.8 ms per loop
时间安排的代码:
df = pd.DataFrame({'Operator':['Military - U.S. Navy','Private','Another Military - German', 'Other']})
df = pd.concat([df]*100000).reset_index(drop=True)
#[400000 rows x 1 columns]
#print (df)
df['plane_type'] = df.Operator.str.extract('(Military|Private)', expand=False).fillna('Passengers')
#print (df)
def plane_type(plane):
m = re.search('\w*Military', plane)
p = re.search('\w*Private', plane)
if m:
return 'Military'
elif p:
return 'Private'
else:
return 'Passengers'
df['plane_type1'] = df['Operator'].astype(str).apply(plane_type)
print (df)