Question

我正在开发一个NLP项目，我想用文本替换文本中的所有URL以简化我的语料库。

这方面的一个例子可能是：

Input: Ask questions here https://stackoverflow.com/questions/ask
Output: Ask questions here stackoverflow.com

此时我正在寻找具有以下RE的网址：

urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', text)

然后我迭代它们以获取域名：

doms = [re.findall(r'^(?:https?:)?(?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n]+)',url) for url in urls]

然后我只需用其dom替换每个URL。

这不是一种最佳方法，我想知道是否有人能够更好地解决这个问题！

Answer 1

您可以使用re.sub：

import re
s = 'Ask questions here https://stackoverflow.com/questions/ask, new stuff here https://stackoverflow.com/questions/, Final ask https://stackoverflow.com/questions/50565514/find-urls-in-text-and-replace-them-with-their-domain-name mail server here mail.inbox.com/whatever'
new_s = re.sub('https*://[\w\.]+\.com[\w/\-]+|https*://[\w\.]+\.com|[\w\.]+\.com/[\w/\-]+', lambda x:re.findall('(?<=\://)[\w\.]+\.com|[\w\.]+\.com', x.group())[0], s)

输出：

'Ask questions here stackoverflow.com, new stuff here stackoverflow.com, Final ask stackoverflow.com mail server here mail.inbox.com'

Answer 2

您也可以匹配以http开头的模式http\S+，然后匹配不匹配网址的空格。解析url并返回主机名部分：

import re
from urllib.parse import urlparse

subject = "Ask questions here https://stackoverflow.com/questions/ask and here https://stackoverflow.com/questions/"
print(re.sub("http\S+", lambda match: urlparse(match.group()).hostname, subject))

Demo Python 3

Demo Python 2

修改：如果字符串可以以http或www开头，则可以使用(?:http|www\.)\S+：

def checkLink(str):
    str = str.group(0)
    if not str.startswith('http'):
        str = '//' + str
    return urlparse(str).hostname
print(re.sub("(?:http|www\.)\S+", checkLink, subject))

Demo

在文本中查找URL并将其替换为其域名

2 个答案: