熊猫:尝试将带有正则表达式的apply方法用于列

时间:2016-12-13 19:41:55

标签: python pandas

所以,我有关于飞机失事的数据框。

In []: df = pd.read_csv('Airplane_Crashes_and_Fatalities_Since_1908.csv')
In []: df.info()
In []: df.head()

Out []: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5268 entries, 0 to 5267
Data columns (total 13 columns):
Date            5268 non-null object
Time            3049 non-null object
Location        5248 non-null object
Operator        5250 non-null object
Flight #        1069 non-null object
Route           3562 non-null object
Type            5241 non-null object
Registration    4933 non-null object
cn/In           4040 non-null object
Aboard          5246 non-null float64
Fatalities      5256 non-null float64
Ground          5246 non-null float64
Summary         4878 non-null object
dtypes: float64(3), object(10)
memory usage: 535.1+ KB
Out []:
         Date   Time                            Location  \
0  09/17/1908  17:18                 Fort Myer, Virginia   
1  07/12/1912  06:30             AtlantiCity, New Jersey   
2  08/06/1913    NaN  Victoria, British Columbia, Canada   
3  09/09/1913  18:30                  Over the North Sea   
4  10/17/1913  10:30          Near Johannisthal, Germany   

             Operator      Flight #          Route                    Type  \
0    Military - U.S. Army      NaN  Demonstration        Wright Flyer III   
1    Military - U.S. Navy      NaN    Test flight               Dirigible   
2                 Private        -            NaN        Curtiss seaplane   
3  Military - German Navy      NaN            NaN  Zeppelin L-1 (airship)   
4  Military - German Navy      NaN            NaN  Zeppelin L-2 (airship)   

   Registration cn/In     Aboard  Fatalities  Ground  \
0          NaN     1     2.0         1.0     0.0   
1          NaN   NaN     5.0         5.0     0.0   
2          NaN   NaN     1.0         1.0     0.0   
3          NaN   NaN    20.0        14.0     0.0   
4          NaN   NaN    30.0        30.0     0.0   

                                         Summary  
0  During a demonstration flight, a U.S. Army fly...  
1  First U.S. dirigible Akron exploded just offsh...  
2  The first fatal airplane accident in Canada oc...  
3  The airship flew into a thunderstorm and encou...  
4  Hydrogen gas which was being vented was sucked...    

所以我想对运营商&#39;进行分类。列并创建包含平面类型的new。 我尝试使用.apply()和正则表达式:

def plane_type(plane):
   m = re.search('\w*Military', plane)
   p = re.search('\w*Private', plane)
   if m:
      return 'Military'
   elif p:
      return 'Private'
   else:
      return 'Passengers'

df['plane_type'] = df['operator'].apply(plane_type)

还尝试使用lambda:

 df['plane_type'] = df['operator'].apply(lambda x: plane_type(x))

每次我收到TypeError后结束:

TypeError: expected string or buffer

请有人告诉我,我失踪了什么?

1 个答案:

答案 0 :(得分:0)

我认为您可以先使用extract^仅提取字符串开头的值),然后使用fillna获取缺失值:

df['plane_type'] = df.Operator.str.extract('(^Military|^Private)', expand=False)
df['plane_type'] = df['plane_type'].fillna('Passengers')

样品:

df = pd.DataFrame({'Operator':['Military - U.S.  Navy','Private',
                               'Another Military - German', 'Other']})
print (df)
                    Operator
0      Military - U.S.  Navy
1                    Private
2  Another Military - German
3                      Other

df['plane_type'] = df.Operator.str.extract('(^Military|^Private)', expand=False)
df['plane_type'] = df['plane_type'].fillna('Passengers')
print (df)
                    Operator  plane_type
0      Military - U.S.  Navy    Military
1                    Private     Private
2  Another Military - German  Passengers
3                      Other  Passengers

此外,如果需要通过关键字提取所有值,请忽略^

df['plane_type'] = df.Operator.str.extract('(Military|Private)', expand=False)
df['plane_type'] = df['plane_type'].fillna('Passengers')
print (df)
                    Operator  plane_type
0      Military - U.S.  Navy    Military
1                    Private     Private
2  Another Military - German    Military
3                      Other  Passengers

<强>计时

apply更慢:

#400k rows
In [80]: %timeit df['plane_type'] = df.Operator.str.extract('(Military|Private)', expand=False).fillna('Passengers')
1 loop, best of 3: 711 ms per loop

In [81]: %timeit df['plane_type1'] = df['Operator'].astype(str).apply(plane_type)
1 loop, best of 3: 1.69 s per loop
#6k rows
In [84]: %timeit df['plane_type'] = df.Operator.str.extract('(Military|Private)', expand=False).fillna('Passengers')
100 loops, best of 3: 10.8 ms per loop

In [85]: %timeit df['plane_type1'] = df['Operator'].astype(str).apply(plane_type)
10 loops, best of 3: 25.8 ms per loop

时间安排的代码

df = pd.DataFrame({'Operator':['Military - U.S.  Navy','Private','Another Military - German', 'Other']})
df = pd.concat([df]*100000).reset_index(drop=True)
#[400000 rows x 1 columns]
#print (df)


df['plane_type'] = df.Operator.str.extract('(Military|Private)', expand=False).fillna('Passengers')
#print (df)

def plane_type(plane):
   m = re.search('\w*Military', plane)
   p = re.search('\w*Private', plane)
   if m:
      return 'Military'
   elif p:
      return 'Private'
   else:
      return 'Passengers'

df['plane_type1'] = df['Operator'].astype(str).apply(plane_type)
print (df)