上个月我为this拉丁字典做了一个刮刀。它终于完成了执行(该网站给了我每页6到8秒的响应时间)。糟糕的是,我发现我的数据中有很大一部分受到严重损害...... 例如。 commandūcor----> command \ xc5 \ xabcor || commandūcāris---->命令\ XC5 \ XABC \ XC4 \ x81ris
我犯了一个愚蠢的错误,即对我从请求中获取的原始数据使用str()函数。就像这样:
import requests
r = requests.get("https://www.dizionario-latino.com/dizionario-latino-
flessione.php?lemma=COMMANDUCOR100", verify = False)
out = str(r.content)
with open("test.html", 'w') as file:
file.write(out)
如果有人能帮我恢复破碎的文字,我真的很感激。 提前谢谢!
答案 0 :(得分:4)
utf-8
只使用b'command\xc5\xabcor'.decode() # 'commandūcor'
b'command\xc5\xabc\xc4\x81ris'.decode() # 'commandūcāris'
(默认值)。您可以在Python .decode
中阅读有关字符编码的更多信息。
const glanceGames = this.state.gameData.map(game => {
return <GameTable
key={game.id}
home_team_name={game.home_name_abbrev}
away_team_name={game.away_name_abbrev}
home_score={game.linescore.r.home}
away_score={game.linescore.r.away}
status={game.status.status}
/>
})
答案 1 :(得分:1)
r.content
返回bytes
。 (相比之下,r.text
returns a str
。requests
模块尝试根据HTTP标头猜测正确的解码,并使用该编码为您解码字节。将来也许这就是您想要使用的)。
如果r.content
包含bytes
,例如b'command\xc5\xabcor'
,那么
str(r.content)
会返回str
,其中以b'
字符开头,以文字'
结尾。
In [45]: str(b'command\xc5\xabcor')
Out[45]: "b'command\\xc5\\xabcor'"
您可以使用ast.literal_eval
恢复字节:
In [46]: ast.literal_eval(str(b'command\xc5\xabcor'))
Out[46]: b'command\xc5\xabcor'
然后,您可以将这些bytes
解码为str
。您发布的URL声明内容为UTF-8编码:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
假设您下载的所有数据都使用相同的编码,您可以通过调用bytes.decode('utf-8')
方法将内容恢复为str:
In [47]: ast.literal_eval(str(b'command\xc5\xabcor')).decode('utf-8')
Out[47]: 'commandūcor'
import ast
import requests
r = requests.get("https://www.dizionario-latino.com/dizionario-latino-flessione.php?lemma=COMMANDUCOR100", verify = False)
out = str(r.content)
with open("test.html", 'w') as file:
file.write(out)
with open("test.html", 'r') as f_in, open("test-fixed.html", 'w') as f_out:
broken_text = f_in.read()
content = ast.literal_eval(broken_text)
assert content == r.content
text = content.decode('utf-8')
f_out.write(text)