我想要做的是用href =“abc / ...”替换href =“...”。 除非......是http://和https://
我已成功完成第一部分,但我找不到检测http://和https://的方法, 以下是代码:
line='<a href="img/a.html"/>'
print re.sub(r'href="([^<#][^"]*)"',r'href="abc/\1"', line)
//Correct Output: <a href="abc/img/a.html"/>
line='<a href="http://google.com"/>'
print re.sub(r'href="([^<#][^"]*)"',r'href="abc/\1"', line)
//WrongOutput: <a href="abc/http://google.com"/>
答案 0 :(得分:2)
>>> import re
>>> from bs4 import BeautifulSoup
>>> s = """<a href="img/a.html"/>
<a href="http://google.com"/>"""
>>> soup = BeautifulSoup(s)
>>> for i in soup.select('a'):
if re.match(r'(?!https?://)', i['href']):
i['href'] = 'abc/' + i['href']
>>> print(soup)
<html><body><a href="abc/img/a.html"></a>
<a href="http://google.com"></a></body></html>
或强>
不,这里需要正则表达式。
>>> for i in soup.select('a'):
if not i['href'].startswith('http://') or i['href'].startswith('https://'):
i['href'] = 'abc/' + i['href']
>>> print(soup)
<html><body><a href="abc/img/a.html"></a>
<a href="http://google.com"></a></body></html>
或强>
>>> for i in soup.select('a'):
if not i['href'].startswith(('http://', 'https://')):
i['href'] = 'abc/' + i['href']
>>> soup
<html><body><a href="abc/img/a.html"></a>
<a href="http://google.com"></a></body></html>
答案 1 :(得分:0)
您可以使用环顾四周
>>> line='<a href="img/a.html"/>'
>>> re.sub(r'(?<=href=")(?!https?)',r'abc/', line)
'<a href="abc/img/a.html"/>'
>>> line='<a href="http://google.com"/>'
>>> re.sub(r'(?<=href=")(?!https?)',r'abc/', line)
'<a href="http://google.com"/>'
(?<=href=")
正面检查字符串位置是否由href="
提出
(?!https?)
负面展望未来。在href="
不后跟http
或https
答案 2 :(得分:-1)
这适用于那些可以将任务移植到HTML解析库的人(如BeautifulSoup)
import bs4
# this adds some content to create a valid doc, we'll ignore it
# since we don't need it
element = bs4.BeautifulSoup('<a href="img/a.html"/>')
print element
element.a['href'] = 'abc/' + element.a['href']
# link has changed - print element tag
print element.a
# to get the string simply cast to string
print str(element.a)
# prints: <a href="abc/img/a.html"></a>