我已经提取了以下正则表达式来提取机器人链接:
re.compile(r"/\S+(?:\/+)")
我得到以下结果:
/includes/
/modules/
/search/
/?q=user/password/
/?q=user/register/
/node/add/
/logout/
/?q=admin/
/themes/
/?q=node/add/
/admin/
/?q=comment/reply/
/misc/
//example.com/
//example.com/site/
/profiles/
//www.robotstxt.org/wc/
/?q=search/
/user/password/
/?q=logout/
/comment/reply/
/?q=filter/tips/
/?q=user/login/
/user/register/
/user/login/
/scripts/
/filter/tips/
//www.sxw.org.uk/computing/robots/
如何排除包含两个斜杠的链接:
//www.sxw.org.uk/computing/robots/
//www.robotstxt.org/wc/
//example.com/
//example.com/site/
任何想法??
答案 0 :(得分:1)
我建议只添加一个if
条件:
if not line.startswith(r'//'):
#then do something here
答案 1 :(得分:1)
假设要匹配的字符串出现在每行上,就像样本中一样,我们可以锚定正则表达式并使用否定前瞻
^(?!//)/\S+(?:\/+)
请务必设置正则表达式,使^匹配行的开头。
我的Python很生疏但是应该这样做
for match in re.finditer(r"(?m)^(?!//)/\S+(?:/+)", subject):
# match start: match.start()
# match end (exclusive): match.end()
# matched text: match.group()