Question

我有一个这样的字符串：

url = 'http://scholar.google.pl/citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dlabel:security\x26after_author\x3drukAAOJ8__8J\x26astart\x3d10'

我希望将其转换为：

converted_url = 'https://scholar.google.pl/citations?view_op=search_authors&hl=en&mauthors=label:security&after_author=rukAAOJ8__8J&astart=10'

我试过这个：

converted_url = url.decode('utf-8')

但是，抛出此错误：

AttributeError: 'str' object has no attribute 'decode'

Answer 1

decode用于将bytes转换为string。您的网址为string，而不是bytes。

您可以使用encode将此string转换为bytes，然后使用decode转换为正确的string。

（我使用前缀r来模拟带有此问题的文本 - 没有前缀url不必转换）

url = r'http://scholar.google.pl/citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dlabel:security\x26after_author\x3drukAAOJ8__8J\x26astart\x3d10'
print(url)

url = url.encode('utf-8').decode('unicode_escape')
print(url)

结果：

http://scholar.google.pl/citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dlabel:security\x26after_author\x3drukAAOJ8__8J\x26astart\x3d10

http://scholar.google.pl/citations?view_op=search_authors&hl=pl&oe=Latin2&mauthors=label:security&after_author=rukAAOJ8__8J&astart=10

BTW：首先检查print(url)您可能有正确的网址，但是您使用了错误的方法来显示它。 Python Shell使用print()显示所有没有print(repr())的结果，显示一些字符作为代码来显示文本中使用的结束编码（utf-8，iso-8859-1，win-1250，latin-1，等）

Scrape Google Scholar安全页面

1 个答案: