我有XML,其中包含许多URL编码的Web链接。在解码其中的所有Web链接之前,我无法使用此XML。
我已经在python中编写了这样的代码:
import re
from urllib.parse import unquote
from transliterate import translit, get_available_language_codes
myString = """><tr><td style="text-align: center;"><a href="https://somewebsite.com/s1600/%25D0%2593%25D0%259E%25D0%25A0%25D0%259E%25D0%25A1%25D0%259A%25D0%259E%25D0%259F%2B%25D0%2592%25D0%25A0%%25D0%2590.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="470" data-original-width="820" height="366" src="https://somewebsite.com/AAAAAAAAzAM/nhpZKVGvEWAn-UNufwn1npX7aTucSWFUwCLcBGAs/s640/%25D0%2593%25D0%259E%25D0%25A0%25D0%259E%25D0%25A1%25D0%259A%25D0%22%25D0%2598.%2B%25D0%25A1%25D0%2590%25D0%259C%25D0%25AB%25D0%2595%90.jpg" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">;<br /><a name='more'></a><br /><br /><div align="center"><script async="" src="//pagead2.googlesyndication.com/pagead/jshttps://somewebsite.com/-_7TnRcBGpRY/%2597%25D0%259D%25D0%2590%25D0%259A%25D0%25A3%2B%25D0%2597%25D0%259E%25D0%2594%25D0%2598%25D0%2590%25D0%259A%25D0%2590.jpg"""
b = re.findall("(?P<url>https?://[^\s]+)", myString)
c = unquote(unquote(b))
d = translit(c, 'ru', reversed=True)
现在我可以: 1.分别解码任何链接 2.创建一个解码链接数组
但是我不知道如何在myString中将所有编码的链接(默认链接)替换为我解码的那些链接。
我找到了一种接收所有解码链接的方法,但我真的不知道如何用新链接替换myString中的旧链接。
答案 0 :(得分:0)
您可以使用html.unescape
来使字符串更易于解析,然后使用BeautifulSoup4(通过pip install bs4)查找所有标签上的循环,并进行所需的操作以使src / href /无论您将属性指定为什么形状,然后将汤对象转换回字符串。
from html import unescape
from urllib.parse import unquote
from bs4 import BeautifulSoup
myString = """><tr><td style="text-align: center;"><a href="https://somewebsite.com/s1600/%25D0%2593%25D0%259E%25D0%25A0%25D0%259E%25D0%25A1%25D0%259A%25D0%259E%25D0%259F%2B%25D0%2592%25D0%25A0%%25D0%2590.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="470" data-original-width="820" height="366" src="https://somewebsite.com/AAAAAAAAzAM/nhpZKVGvEWAn-UNufwn1npX7aTucSWFUwCLcBGAs/s640/%25D0%2593%25D0%259E%25D0%25A0%25D0%259E%25D0%25A1%25D0%259A%25D0%22%25D0%2598.%2B%25D0%25A1%25D0%2590%25D0%259C%25D0%25AB%25D0%2595%90.jpg" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">;<br /><a name='more'></a><br /><br /><div align="center"><script async="" src="//pagead2.googlesyndication.com/pagead/jshttps://somewebsite.com/-_7TnRcBGpRY/%2597%25D0%259D%25D0%2590%25D0%259A%25D0%25A3%2B%25D0%2597%25D0%259E%25D0%2594%25D0%2598%25D0%2590%25D0%259A%25D0%2590.jpg"""
soup = BeautifulSoup(unescape(myString), 'html.parser')
# loop over all elements and update anything src/href attributes
for tag in soup.find_all():
for attr in tag.attrs.keys() & {'src', 'href'}:
# do whatever else with tag[attr] here
tag[attr] = unquote(unquote(tag[attr]))
output = str(soup)
给你:
'><tr><td style="text-align: center;"><a href="https://somewebsite.com/s1600/ГОРОСКОП+ВР%А.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="470" data-original-width="820" height="366" src=\'https://somewebsite.com/AAAAAAAAzAM/nhpZKVGvEWAn-UNufwn1npX7aTucSWFUwCLcBGAs/s640/ГОРОСК�"И.+САМЫЕ�.jpg\' width="640"/></a></td></tr><tr><td class="tr-caption" style="text-align: center;">;<br/><a name="more"></a><br/><br/><div align="center"><script async="" src="//pagead2.googlesyndication.com/pagead/jshttps://somewebsite.com/-_7TnRcBGpRY/%2597%25D0%259D%25D0%2590%25D0%259A%25D0%25A3%2B%25D0%2597%25D0%259E%25D0%2594%25D0%2598%25D0%2590%25D0%259A%25D0%2590.jpg</div></td></tr>'
当然-您的里程会随着解析器对开始输入的理解程度而有所不同。