根据正则表达式模式在Python中拆分字符串

时间:2018-11-11 17:41:35

标签: regex string list python-3.6

我有一个包含网址的 bytes 对象:

> body.decode("utf-8") 
> 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'

我需要将其拆分为一个列表,每个URL作为单独的元素:

import re
pattern = '^(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$'

urls = re.compile(pattern).split(body.decode("utf-8"))

我得到的是一个包含所有URL粘贴在一起的元素的列表:

['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n']

如何将每个网址分成一个单独的元素?

1 个答案:

答案 0 :(得分:1)

尝试用\s+拆分它

尝试此示例python代码,

import re
s = 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'
urls = re.compile('\s+').split(s)
print(urls)

此输出

['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/', 'https://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/', '']

这个结果看起来还好吗?或者我们可以根据您的要求进行加工。

如果您不想在结果列表中使用空字符串('')(由于最后是\ r \ n),则可以使用find all查找字符串中的所有URL。以下是相同的示例python代码,

import re
s = 'https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/\r\n\r\nhttps://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/\r\n'
urls = re.findall('http.*?(?=\s+)', s)
print(urls)

这将提供以下输出,

['https://www.wired.com/story/car-news-roundup-tesla-model-3-sales/', 'https://cleantechnica.com/2018/11/11/can-you-still-get-the-7500-tax-credit-on-a-tesla-model-3-maybe-its-complicated/']