正则表达式（`regex`）匹配

Question

当我抓取网站获取文章网址并获取所有<a>标记并获取所有href属性时，此网址列表中的某些链接不是针对文章的，而是指向其他类别或其他任何内容的链接相同的域名，所以我需要做以下事情：

为网址创建一个模式，并将链接列表中的每个网址与此模式匹配，以便我知道此网址是文章网址还是

模式示例如下：

链接：＆＃34; http://www.cnbc.com/2016/03/13/financial-times-china-rebuts-economy-doomsayers-on-debt-and.html＆＃34;

模式匹配：http://www.cnbc.com/(*)/(*)/(*)/(*).html

所以用（*）

替换链接的任何可变部分的想法

问题是如何匹配模式的链接？

Answer 1

正则表达式（`regex`）匹配

您可以使用{{3}}。

执行此操作

import re

# Example url
url = 'http://www.cnbc.com/2016/03/13/financial-times-china-rebuts-economy-doomsayers-on-debt-and.html'
# Create a regex match pattern
pattern = r'http://www.cnbc.com/(.+)/(.+)/(.+)/(.+).html'
# Find match
m = re.match(pattern, url)
# Get Groups
m.groups()

('2016',
 '03',
 '13',
 'financial-times-china-rebuts-economy-doomsayers-on-debt-and')

将网址与python

1 个答案:

正则表达式（`regex`）匹配

将网址与python

1 个答案:

正则表达式（regex）匹配

正则表达式（`regex`）匹配