如何用python替换文本中的几个URL?

时间:2018-09-16 18:38:35

标签: python regex python-3.x

我有XML,其中包含许多URL编码的Web链接。在解码其中的所有Web链接之前,我无法使用此XML。

我已经在python中编写了这样的代码:

import re
from urllib.parse import unquote
from transliterate import translit, get_available_language_codes

myString = """><tr><td style="text-align: center;"><a href="https://somewebsite.com/s1600/%25D0%2593%25D0%259E%25D0%25A0%25D0%259E%25D0%25A1%25D0%259A%25D0%259E%25D0%259F%2B%25D0%2592%25D0%25A0%%25D0%2590.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="470" data-original-width="820" height="366" src="https://somewebsite.com/AAAAAAAAzAM/nhpZKVGvEWAn-UNufwn1npX7aTucSWFUwCLcBGAs/s640/%25D0%2593%25D0%259E%25D0%25A0%25D0%259E%25D0%25A1%25D0%259A%25D0%22%25D0%2598.%2B%25D0%25A1%25D0%2590%25D0%259C%25D0%25AB%25D0%2595%90.jpg" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">;<br /><a name='more'></a><br /><br /><div align="center"><script async="" src="//pagead2.googlesyndication.com/pagead/jshttps://somewebsite.com/-_7TnRcBGpRY/%2597%25D0%259D%25D0%2590%25D0%259A%25D0%25A3%2B%25D0%2597%25D0%259E%25D0%2594%25D0%2598%25D0%2590%25D0%259A%25D0%2590.jpg"""
b = re.findall("(?P<url>https?://[^\s]+)", myString)
c = unquote(unquote(b))
d = translit(c, 'ru', reversed=True)

现在我可以: 1.分别解码任何链接 2.创建一个解码链接数组

但是我不知道如何在myString中将所有编码的链接(默认链接)替换为我解码的那些链接。

我找到了一种接收所有解码链接的方法,但我真的不知道如何用新链接替换myString中的旧链接。

1 个答案:

答案 0 :(得分:0)

您可以使用html.unescape来使字符串更易于解析,然后使用BeautifulSoup4(通过pip install bs4)查找所有标签上的循环,并进行所需的操作以使src / href /无论您将属性指定为什么形状,然后将汤对象转换回字符串。

from html import unescape
from urllib.parse import unquote
from bs4 import BeautifulSoup

myString = """&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="https://somewebsite.com/s1600/%25D0%2593%25D0%259E%25D0%25A0%25D0%259E%25D0%25A1%25D0%259A%25D0%259E%25D0%259F%2B%25D0%2592%25D0%25A0%%25D0%2590.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" data-original-height="470" data-original-width="820" height="366" src="https://somewebsite.com/AAAAAAAAzAM/nhpZKVGvEWAn-UNufwn1npX7aTucSWFUwCLcBGAs/s640/%25D0%2593%25D0%259E%25D0%25A0%25D0%259E%25D0%25A1%25D0%259A%25D0%22%25D0%2598.%2B%25D0%25A1%25D0%2590%25D0%259C%25D0%25AB%25D0%2595%90.jpg" width="640" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;;&lt;br /&gt;&lt;a name='more'&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;div align="center"&gt;&lt;script async="" src="//pagead2.googlesyndication.com/pagead/jshttps://somewebsite.com/-_7TnRcBGpRY/%2597%25D0%259D%25D0%2590%25D0%259A%25D0%25A3%2B%25D0%2597%25D0%259E%25D0%2594%25D0%2598%25D0%2590%25D0%259A%25D0%2590.jpg"""

soup = BeautifulSoup(unescape(myString), 'html.parser')
# loop over all elements and update anything src/href attributes
for tag in soup.find_all():
    for attr in tag.attrs.keys() & {'src', 'href'}:
        # do whatever else with tag[attr] here
        tag[attr] = unquote(unquote(tag[attr]))

output = str(soup)

给你:

'&gt;<tr><td style="text-align: center;"><a href="https://somewebsite.com/s1600/ГОРОСКОП+ВР%А.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="470" data-original-width="820" height="366" src=\'https://somewebsite.com/AAAAAAAAzAM/nhpZKVGvEWAn-UNufwn1npX7aTucSWFUwCLcBGAs/s640/ГОРОСК�"И.+САМЫЕ�.jpg\' width="640"/></a></td></tr><tr><td class="tr-caption" style="text-align: center;">;<br/><a name="more"></a><br/><br/><div align="center">&lt;script async="" src="//pagead2.googlesyndication.com/pagead/jshttps://somewebsite.com/-_7TnRcBGpRY/%2597%25D0%259D%25D0%2590%25D0%259A%25D0%25A3%2B%25D0%2597%25D0%259E%25D0%2594%25D0%2598%25D0%2590%25D0%259A%25D0%2590.jpg</div></td></tr>'

当然-您的里程会随着解析器对开始输入的理解程度而有所不同。