Question

我一直在尝试从网址列表中提取域名，以便http://supremecosts.com/contact-us/成为http://supremecosts.com。我试图找到一种干净的方式来适应各种gtlds和cctlds。

Answer 1

你可以使用这样的正则表达式来实现：

import re

text = 'http://supremecosts.com/contact-us/'

m = re.search('(https?:\/\/[^:\/\n]+)', text)
if m:
    print(m.group(1))

工作example

Answer 2

假设您使用For all pairs (v_i, v_j) If v_i is smaller than v_j and v_j 'blocks' v_i draw an edge starting from v_i to v_j For all nodes v_i Find the v_i with no incoming edges and the biggest height -> write it at the end of the result list -> remove v_i and all of its outgoing edges from graph并且不想使用python3作业

regex

Answer 3

您可以使用正则表达式提取网址的域和子域。

/^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/\n]+)/im

我使用这样从URL中提取域名。检查这是否适合你。

单个正则表达式，用于解析和分解域名，协议，查询的完整URL，如下面的正则表达式。

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$

其中正则表达式的位置是这样的

url: RegExp['$&'],
protocol you are using at RegExp.$2
domain name at RegExp.$3
path at RegExp.$4

Answer 4

这样做可能是一种愚蠢但有效的方法：
将URL保存在字符串中并从后向前扫描。一旦你遇到一个完整的停止，从前面3个空格废弃一切。我相信在域名之后网址没有完全停止。如果我错了，请纠正我。

仅从url中提取域名，删除路径（Python）

4 个答案: