如果网址中有重音字符,如何访问网址

时间:2015-09-15 01:54:08

标签: python utf-8 character-encoding urllib

所以我正在开发一个脚本,该脚本将自动从以json格式传递信息的Web服务下载和写入数据。他们是加拿大的政党,因此,重音人物经常出现。

例如,要访问代表“BlocQuébécois”一方的候选人的数据,我需要访问此网址:

https://represent.opennorth.ca/candidates/house-of-commons/?limit=1000&party_name=Bloc%20Qu%C3%A9b%C3%A9cois

不幸的是,用e替换é的简单解决方案不起作用。

所以我的脚本看起来像这样“

Microsoft.Xna.Framework.Content.ContentLoadException was unhandled
  HResult=-2146233088
  Message=Could not load board2 asset as a non-content file!
  Source=MonoGame.Framework
  StackTrace:
       at Microsoft.Xna.Framework.Content.ContentManager.ReadAsset[T](String assetName, Action`1 recordDisposableObject)
       at Microsoft.Xna.Framework.Content.ContentManager.Load[T](String assetName)
       at MMCreate.Game1.LoadContent() in C:\Shri\CSProjects\GameProjects\MMCreate\Game1.cs:line 103
       at Microsoft.Xna.Framework.Game.Initialize()
       at MMCreate.Game1.Initialize() in C:\Shri\CSProjects\GameProjects\MMCreate\Game1.cs:line 89
       at Microsoft.Xna.Framework.Game.DoInitialize()
       at Microsoft.Xna.Framework.Game.Run(GameRunBehavior runBehavior)
       at Microsoft.Xna.Framework.Game.Run()
       at MMCreate.Program.Main() in C:\Shri\CSProjects\GameProjects\MMCreate\Program.cs:line 22
       at System.AppDomain._nExecuteAssembly(RuntimeAssembly assembly, String[] args)
       at System.AppDomain.ExecuteAssembly(String assemblyFile, Evidence assemblySecurity, String[] args)
       at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly()
       at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
       at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
       at System.Threading.ThreadHelper.ThreadStart()
  InnerException: 
       HResult=-2146233088
       Message=The content file was not found.
       Source=MonoGame.Framework
       StackTrace:
            at Microsoft.Xna.Framework.Content.ContentManager.OpenStream(String assetName)
            at Microsoft.Xna.Framework.Content.ContentManager.ReadAsset[T](String assetName, Action`1 recordDisposableObject)
       InnerException: 
            FileName=C:\Shri\CSProjects\GameProjects\MMCreate\bin\Windows\Debug\Content\board2.xnb
            HResult=-2147024894
            Message=Could not find file 'C:\Shri\CSProjects\GameProjects\MMCreate\bin\Windows\Debug\Content\board2.xnb'.
            Source=mscorlib
            StackTrace:
                 at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
                 at System.IO.FileStream.Init(String path, FileMode mode, FileAccess access, Int32 rights, Boolean useRights, FileShare share, Int32 bufferSize, FileOptions options, SECURITY_ATTRIBUTES secAttrs, String msgPath, Boolean bFromProxy, Boolean useLongPath, Boolean checkHost)
                 at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share)
                 at Microsoft.Xna.Framework.TitleContainer.OpenStream(String name)
                 at Microsoft.Xna.Framework.Content.ContentManager.OpenStream(String assetName)
            InnerException: 

我知道这与utf-8编码有关,但是我很难绕过它,而我在这里和其他网站上找到的其他链接也无济于事。

我尝试在urlopen调用中添加.encode('utf-8'),如下所示:

import urllib

#party_name_list = ["Conservative", "Liberal", "NDP", "Green%20Party", "Bloc%20Québécois", "Forces%20et%20Démocratie", "Libertarian", "Christian%20Heritage"]
party_name_list = ["Bloc%20Québécois"]

for party_name in party_name_list:
    with urllib.request.urlopen(r"https://represent.opennorth.ca/candidates/house-of-commons/?limit=1000&party_name={}".format(party_name)) as url:
        with open(r"F:\electoral_map\20150914\candidates\candidates_{0}.json".format(party_name), "wb+") as f:
            f.write(url.read())
    print("finished {0}".format(party_name))
print("all done")

但是这只会使文件返回空,因为它现在调用url:

https://represent.opennorth.ca/candidates/house-of-commons/?limit=1000&party_name=b '阵营%20Qu \ XC3 \ xa9b \ XC3 \ xa9cois'

有人可以帮我理解如何弄清楚这个烂摊子吗?

2 个答案:

答案 0 :(得分:1)

我解决了它,但我认为这不是最优雅的解决方案,说实话,我并没有完全理解它。也许有人可以更好地解释它,但使用urllib.parse.unquote_plus()帮助我:

xxd

答案 1 :(得分:1)

你正在混合苹果和橘子。用于表示字符串的字节如“Québécois”或“”取决于字符集和编码。 通常,现代网站将在URL中使用UTF-8,但不能保证。

在UTF-8(基本上所有其他现代编码)中,空间由一个字节0x20表示 - 这是您看到URL编码为%20的内容。字符éU+00E9)使用字节序列0xC3 0xA9进行编码(虽然注意它可以等效地分解为0x65 0xCC 0x81!)然后再次应用URL编码产生%C3%A9

但无论如何,就像你发现的那样,urllib会为你很好地和透明地处理这个问题,所以你真的不需要理解上面的内容。我认为你在your own answer中提到的代码是正确和惯用的。

在一般情况下正确理解需要至少了解最常见的不同character encodings以及Unicode normalization