Question

我正在使用Python 3.5和re模块处理一个scraper / web抓取工具，其中一个功能需要检索YouTube频道的网址。我使用包含正则表达式匹配的以下代码部分来完成此任务：

href = re.compile("(/user/|/channel/)(.+)")

它应返回的内容类似于/user/username或/channel/channelname。它在大多数情况下成功完成了这项工作，但它偶尔会抓取一种URL，其中包含/user/username/videos?view=60之类的更多信息或username/部分之后发生的其他信息。

为了解决这个问题，我将上面的代码重写为

href = re.compile("(/user/|/channel/)(?!(videos?view=60)(.+)")

以及其他变化没有成功。如何重写我的代码，以便它在URL中的任何位置提取不包含videos?view=60的URL？

Answer 1

对特定的正则表达式模式使用以下方法：

user_url = '/user/username/videos?view=60'
channel_url = '/channel/channelname/videos?view=60'

pattern = re.compile(r'(/user/|/channel/)([^/]+)')

m = re.match(pattern, user_url)
print(m.group())    # /user/username

m = re.match(pattern, channel_url)
print(m.group())    # /channel/channelname

Answer 2

我使用了这种方法，它似乎做了你想要的。

import re

user = '/user/username/videos?view=60'
channel = '/channel/channelname/videos?view=60'

pattern = re.compile(r"(/user/|/channel/)[\w]+/")

user_match = re.search(pattern, user)

if user_match:
    print user_match.group()
else:
    print "Invalid Pattern"

pattern_match = re.search(pattern,channel)

if pattern_match:
    print pattern_match.group()
else:
    print "Invalid pattern"

希望这有帮助！

使用正则表达式查找不包含特定信息的网址

2 个答案: