Question

我正在努力思考某个操作背后的逻辑。

我有一个URL列表，例如：

["https://example1.com", 
"example2.com",
"http://example3.com/subpage",
"http://example4.com/",
"http://example5.com/subpage"]

，我需要提取前3个结果，但仅在结果仅是域的情况下才应提取。如果它有一个子页面，我想忽略它。

任何想法该怎么做？我想第一件事就是浏览列表并删除子页面中的所有内容，然后选择前3个。

但是确定URL是仅域还是子页面的最佳方法是什么？

非常感谢您的帮助！

Answer 1

您可以过滤列表，然后使用列表切片：

import re
d = ['https://example1.com', 'example2.com', 'http://example3.com/subpage', 'http://example4.com/', 'http://example5.com/subpage']
new_d = [i for i in d if re.findall('\.[a-z]{3}$|\.[a-z]{3}/$', i)][:3]

输出：

['https://example1.com', 'example2.com', 'http://example4.com/']

编辑：正则表达式说明：

\.：匹配"."

的出现

[a-z]{3}：匹配"."后的三个字母

$：将表达式定位在字符串的最后。

在Python的URL列表中选择“前3个仅域的URL”

1 个答案: