Question

我的csv（或数据帧）内容如下：

date          | URLs                                         | Count
-----------------------------------------------------------------------
17-mar-2014   | www.example.com/abcdef&=randstring           | 20
10-mar-2016   | www.example.com/xyzabc                       | 12
14-apr-2015   | www.example.com/abcdef                       | 11
12-mar-2016   | www.example.com/abcdef/randstring            | 30
15-mar-2016   | www.example.com/abcdef                       | 10
17-feb-2016   | www.example.com/xyzabc&=randstring           | 15
17-mar-2016   | www.example.com/abcdef&=someotherrandstring  | 12

我想清理我要将www.example.com/abcdef&=randstring或www.example.com/abcdef/randstring转换为www.example.com/abcdef的列'URL'，依此类推，以便所有行。

我尝试使用urlparse库并解析网址以仅与urlparse(url).netloc和urlparse(url).path / query / params合并。但是，由于每个URL导致完全不同的路径/查询/参数，因此调整效率低下。

使用熊猫有没有解决这个问题？任何提示/建议都非常感谢。

Answer 1

我认为它与pandas相关的正则表达式，尝试使用pandas.apply来更改一列。

import pandas as pd
import re

def clear_url(origin_url):
    p = re.compile('(www.example.com/[a-zA-Z]*)')
    r = p.search(origin_url)
    if r:
        return r.groups(1)[0]
    else:
        return origin_url


d = [
    {'id':1, 'url':'www.example.com/abcdef&=randstring'},
    {'id':2, 'url':'www.example.com/abcdef'},
    {'id':3, 'url':'www.example.com/xyzabc&=randstring'}
]
df = pd.DataFrame(d)

print 'origin_df'
print df

df['url'] = df['url'].apply(clear_url)
print 'new_df'
print df

输出：

origin_df
id                                 url
0   1  www.example.com/abcdef&=randstring
1   2              www.example.com/abcdef
2   3  www.example.com/xyzabc&=randstring
new_df
id                     url
0   1  www.example.com/abcdef
1   2  www.example.com/abcdef
2   3  www.example.com/xyzabc

Answer 2

我认为您可以regex使用extract - 在a-z和A-Z之间过滤www和.com创建的所有字符串另一个字符串以/开头：

print (df.URLs.str.extract('(www.[a-zA-Z]*.com/[a-zA-Z]*)', expand=False))
0    www.example.com/abcdef
1    www.example.com/xyzabc
2    www.example.com/abcdef
3    www.example.com/abcdef
4    www.example.com/abcdef
5    www.example.com/xyzabc
6    www.example.com/abcdef
Name: URLs, dtype: object

清理pandas dataframe

2 个答案: