Question

我有一个数据框，其中有一列包含文本数据。我想从文本数据中删除所有URL链接。例如，df列看起来类似于此-

user_id      post_title
    1        #hello....world!!https://www.facebook.com
    2        https://www.google.com
    3        https://www.facebook.com

我尝试这样做，但收到错误“ str”对象，不能将其解释为整数。该如何解决？

def replaceURL(post_title):
   post_title = post_title.map(lambda x: re.sub('((www\.[^\s]+)|(https?://[^\s]+))','',str(x)))
    post_title = post_title.str.strip()
    post_title = post_title.map(lambda x: re.sub(r'#([^\s]+)', r'\1','',str(x)))
    return post_title

df['post_title'] = replaceURL(df['post_title'])
df['post_title_length'] = df['post_title'].str.len()
df

输出应为空白值，以代替URL链接：

user_id      post_title
    1        #hello....world!!
    2        
    3

Answer 1

使用熊猫str.extract，

df1['post_title'] = df1['post_title'].str.extract('(.*)http?')

    user_id post_title
0   1       #hello....world!!
1   2   
2   3

注意：如果URL后面可能有文本，我将使用两个捕获组并将其组合。例如：

    user_id post_title
0   1       #hello....world!!https://www.facebook.com
1   2       https://www.google.com
2   3       https://www.facebook.com
3   4       https://facebook.com Hello world


df1['post_title'] = df1['post_title'].str.extract('(.*)http?.*.com?(.*)?').sum(1)

你得到

    user_id post_title
0   1       #hello....world!!
1   2   
2   3   
3   4       Hello world

编辑：这是一个带有http和https链接的新示例df，

    user_id post_title
0   1   #hello....world!!https://www.facebook.com
1   2   https://www.google.com
2   3   https://www.facebook.com
3   4   https://facebook.com Hello world
4   5   #hello....world!!http://www.facebook.com


df1['post_title'].str.replace('http.*.com', '',regex = True)

输出

0    #hello....world!!
1                     
2                     
3          Hello world
4    #hello....world!!

如何从nlp中的数据框列中删除http URL链接

1 个答案: