我正在使用BeautifulSoup解析html。给出以下HTML:
<!DOCTYPE html>
<html>
<body>
<p>An absolute URL: https://www.w3schools.com</p>
</body>
</html>
我希望将其转换为:
<!DOCTYPE html>
<html>
<body>
<p>An absolute URL: <a href="https://www.w3schools.com" target="_blank">https://www.w3schools.com</a></p>
</body>
</html>
到目前为止编写的代码:
def detect_urls_and_update_target(self, root): //root is the soup object
for tag in root.find_all(True):
if tag.name == 'a':
if not tag.has_attr('target'):
tag.attrs['target'] = '_blank'
elif tag.string is not None:
for url in re.findall(self.url_regex, tag.string): //regex which detects URLS which works
new_tag = root.new_tag("a", href=url, target="_blank")
new_tag.string = url
tag.append(new_tag)
这会添加所需的锚标记,但我无法弄清楚如何从标记中删除原始网址。
答案 0 :(得分:1)
您可以使用BeautifulSoup重建父内容,如下所示:
from bs4 import BeautifulSoup
import re
html = """<!DOCTYPE html>
<html>
<body>
<p>An absolute URL: https://www.w3schools.com</p>
<p>Another link: https://stackoverflow.com/questions/50413693/detect-url-and-add-anchor-tags-using-beautifulsoup%22</p>
<div><div>some div</div>Hello world from https://www.google.com</div>
</body>
</html>"""
soup = BeautifulSoup(html, "html.parser")
re_url = re.compile(r'(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)')
for tag in soup.find_all(text=True):
tags = []
url = False
for t in re_url.split(tag.string):
if re_url.match(t):
a = soup.new_tag("a", href=t, target='_blank')
a.string = t
tags.append(a)
url = True
else:
tags.append(t)
if url:
for t in tags:
tag.insert_before(t)
tag.extract()
print(soup)
print()
这将显示以下输出:
<!DOCTYPE html>
<html>
<body>
<p>An absolute URL: <a href="https://www.w3schools.com" target="_blank">https://www.w3schools.com</a></p>
<p>Another link: <a href="https://stackoverflow.com/questions/50413693/detect-url-and-add-anchor-tags-using-beautifulsoup%22" target="_blank">https://stackoverflow.com/questions/50413693/detect-url-and-add-anchor-tags-using-beautifulsoup%22</a></p>
<div><div>some div</div>Hello world from <a href="https://www.google.com" target="_blank">https://www.google.com</a></div>
</body>
</html>
这首先通过使用正则表达式拆分包含文本的任何标签来查找任何URL。对于每个条目,如果是URL,请在列表中将其替换为新的锚标记。如果未找到任何URL,请单独保留标记。接下来,在现有标记之前插入每个更新标记列表,然后删除现有标记。
要跳过DOCTYPE
中的任何网址,find_all()
可能会更改如下:
from bs4 import BeautifulSoup, Doctype
...
for tag in soup.find_all(string=lambda text: not isinstance(text, Doctype)):
答案 1 :(得分:0)
您可以将re.sub
与装饰器一起使用提供的参数来包装标签正文中的任何网址:
import re
def format_hrefs(tags=['p'], _target='blank', a_class=''):
def outer(f):
def format_url(url):
_start = re.sub('https*://www\.[\w\W]+\.\w{3}', '{}', url)
return _start.format(*['<a href="{}" _target="{}" class="{}">{}</a>'.format(i, _target, a_class, i) for i in re.findall('https*://www\.\w+\.\w{3}', url)])
def wrapper():
url = f()
_format = re.sub('|'.join('(?<=\<'+i+'\>)[\w\W]+(?=\</'+i+'\>)' for i in tags), '{}', html)
_text = re.findall('|'.join('(?<=\<'+i+'\>)[\w\W]+(?=\</'+i+'\>)' for i in tags), html)
return _format.format(*[format_url(i) for i in _text])
return wrapper
return outer
@format_hrefs()
def get_html():
content = """
<!DOCTYPE html>
<html>
<body>
<p>An absolute URL: https://www.w3schools.com</p>
</body>
</html>
"""
return content
print(get_html())
输出:
<!DOCTYPE html>
<html>
<body>
<p>An absolute URL: <a href="https://www.w3schools.com" _target="blank" class="">https://www.w3schools.com</a></p>
</body>
</html>