Question

我有很多这种类型的url：

http://www.example.com/some-text-to-get/jkl/another-text-to-get

我希望能够得到这个：

["some-text-to-get", "another-text-to-get"]

我尝试过：

re.findall(".*([[a-z]*-[a-z]*]*).*", "http://www.example.com/some-text-to-get/jkl/another-text-to-get")

但是它不起作用。任何想法？

Answer 1

您可以捕获一个捕获组中的两个部分：

^https?://[^/]+/([^/]+).*/(.*)$

那将匹配：

^从字符串开头进行匹配
https?://将http与可选的s匹配，后跟://
[^/]+/使用否定的字符类和正斜杠不匹配正斜杠
([^/]+)捕获组（组1）而不是正斜杠
.*匹配任意字符零次或多次
/逐字匹配（这是最后一个斜杠，因为.*贪婪
(.*)$在一个捕获组（第2组）中匹配任意字符零次或多次，并断言$行的结尾

您的比赛在第一个和第二个捕获组中。

Demo

或者您可以解析网址，获取路径，用/分割并按索引获取部分：

from urlparse import urlparse

o = urlparse('http://www.example.com/some-text-to-get/jkl/another-text-to-get')
parts = filter(None, o.path.split('/'))
print(parts[0])
print(parts[2])

或者，如果您想获取包含-的零件，则可以使用：

parts = filter(lambda x: '-' in x, o.path.split('/'))
print(parts)

Demo

Answer 2

您可以使用先行查找和后退：

import re
s = 'http://www.example.com/some-text-to-get/jkl/another-text-to-get'
final_result = re.findall('(?<=\.\w{3}/)[a-z\-]+|[a-z\-]+(?=$)', s)

输出：

['some-text-to-get', 'another-text-to-get']

Answer 3

给出：

>>> s
"http://www.example.com/some-text-to-get/jkl/another-text-to-get"

您可以使用此正则表达式：

>>> re.findall(r"/([a-z-]+)(?:/|$)", s)
['some-text-to-get', 'another-text-to-get']

当然，您可以使用Python字符串方法和列表理解来做到这一点：

>>> [e for e in s.split('/') if '-' in e]
['some-text-to-get', 'another-text-to-get']

Answer 4

您可以使用以下正则表达式捕获它：

((?:[a-z]+-)+[a-z]+)

[a-z]+匹配一个或多个字符
(?:[a-z]+-)不在组中捕获

regex：从网址数据中获取部分文本

4 个答案: