我有一个包含用户和网址的df,如下所示。
df
User Url
1 http://www.mycompany.com/Overview/Get
2 http://www.mycompany.com/News
3 http://www.mycompany.com/Accountinfo
4 http://www.mycompany.com/Personalinformation/Index
...
我想添加另一个仅包含网址第二部分的列页面,所以我会像这样。
user url page
1 http://www.mycompany.com/Overview/Get Overview
2 http://www.mycompany.com/News News
3 http://www.mycompany.com/Accountinfo Accountinfo
4 http://www.mycompany.com/Personalinformation/Index Personalinformation
...
我下面的代码不起作用。
slashparts = df['url'].split('/')
df['page'] = slashparts[4]
我遇到的错误
AttributeError Traceback (most recent call last)
<ipython-input-23-0350a98a788c> in <module>()
----> 1 slashparts = df['request_url'].split('/')
2 df['page'] = slashparts[1]
~\Anaconda\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
4370 if
self._info_axis._can_hold_identifiers_and_holds_name(name):
4371 return self[name]
-> 4372 return object.__getattribute__(self, name)
4373
4374 def __setattr__(self, name, value):
AttributeError: 'Series' object has no attribute 'split'
答案 0 :(得分:3)
将熊猫text functions与str
一起使用,对于选择的4.
列表,请使用str[3]
,因为python从0
开始计数:
df['page'] = df['Url'].str.split('/').str[3]
或者如果性能很重要,请使用list comprehension
:
df['page'] = [x.split('/')[3] for x in df['Url']]
print (df)
User Url \
0 1 http://www.mycompany.com/Overview/Get
1 2 http://www.mycompany.com/News
2 3 http://www.mycompany.com/Accountinfo
3 4 http://www.mycompany.com/Personalinformation/I...
page
0 Overview
1 News
2 Accountinfo
3 Personalinformation
答案 1 :(得分:2)
我正在尝试更加明确地处理可能缺少http
和其他变化的地方
pat = '(?:https?://)?(?:www\.)?(?:\w+\.\w+\/)([^/]*)'
df.assign(page=df.Url.str.extract(pat, expand=False))
User Url page
0 1 http://www.mycompany.com/Overview/Get Overview
1 2 http://www.mycompany.com/News News
2 3 www.mycompany.com/Accountinfo Accountinfo
3 1 http://www.mycompany.com/Overview/Get Overview
4 2 mycompany.com/News News
5 3 https://www.mycompany.com/Accountinfo Accountinfo
6 4 http://www.mycompany.com/Personalinformation/I... Personalinformation