如何在Python中仅提取url的特定部分并将其值添加为df中每一行的另一列?

时间:2018-08-29 12:44:34

标签: python pandas dataframe

我有一个包含用户和网址的df,如下所示。

df

User      Url
1         http://www.mycompany.com/Overview/Get
2         http://www.mycompany.com/News
3         http://www.mycompany.com/Accountinfo
4         http://www.mycompany.com/Personalinformation/Index
...

我想添加另一个仅包含网址第二部分的列页面,所以我会像这样。

user      url                                                  page
1         http://www.mycompany.com/Overview/Get                Overview
2         http://www.mycompany.com/News                        News
3         http://www.mycompany.com/Accountinfo                 Accountinfo
4         http://www.mycompany.com/Personalinformation/Index   Personalinformation
...

我下面的代码不起作用。

slashparts = df['url'].split('/')
df['page'] = slashparts[4]

我遇到的错误

  AttributeError                            Traceback (most recent call last)
  <ipython-input-23-0350a98a788c> in <module>()
  ----> 1 slashparts = df['request_url'].split('/')
        2 df['page'] = slashparts[1]

  ~\Anaconda\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   4370             if 
   self._info_axis._can_hold_identifiers_and_holds_name(name):
   4371                 return self[name]
  -> 4372             return object.__getattribute__(self, name)
   4373 
   4374     def __setattr__(self, name, value):

 AttributeError: 'Series' object has no attribute 'split'

2 个答案:

答案 0 :(得分:3)

将熊猫text functionsstr一起使用,对于选择的4.列表,请使用str[3],因为python从0开始计数:

df['page'] = df['Url'].str.split('/').str[3]

或者如果性能很重要,请使用list comprehension

df['page'] = [x.split('/')[3] for x in df['Url']]

print (df)
   User                                                Url  \
0     1              http://www.mycompany.com/Overview/Get   
1     2                      http://www.mycompany.com/News   
2     3               http://www.mycompany.com/Accountinfo   
3     4  http://www.mycompany.com/Personalinformation/I...   

                  page  
0             Overview  
1                 News  
2          Accountinfo  
3  Personalinformation  

答案 1 :(得分:2)

我正在尝试更加明确地处理可能缺少http和其他变化的地方

pat = '(?:https?://)?(?:www\.)?(?:\w+\.\w+\/)([^/]*)'
df.assign(page=df.Url.str.extract(pat, expand=False))

   User                                                Url                 page
0     1              http://www.mycompany.com/Overview/Get             Overview
1     2                      http://www.mycompany.com/News                 News
2     3                      www.mycompany.com/Accountinfo          Accountinfo
3     1              http://www.mycompany.com/Overview/Get             Overview
4     2                                 mycompany.com/News                 News
5     3              https://www.mycompany.com/Accountinfo          Accountinfo
6     4  http://www.mycompany.com/Personalinformation/I...  Personalinformation