事后修复损坏的文本

时间:2018-02-04 16:33:07

标签: python string python-3.x encoding

上个月我为this拉丁字典做了一个刮刀。它终于完成了执行(该网站给了我每页6到8秒的响应时间)。糟糕的是,我发现我的数据中有很大一部分受到严重损害...... 例如。 commandūcor----> command \ xc5 \ xabcor || commandūcāris---->命令\ XC5 \ XABC \ XC4 \ x81ris

我犯了一个愚蠢的错误,即对我从请求中获取的原始数据使用str()函数。就像这样:

import requests

r = requests.get("https://www.dizionario-latino.com/dizionario-latino-
flessione.php?lemma=COMMANDUCOR100", verify = False)

out = str(r.content)

with open("test.html", 'w') as file:
    file.write(out)

如果有人能帮我恢复破碎的文字,我真的很感激。 提前谢谢!

2 个答案:

答案 0 :(得分:4)

utf-8只使用b'command\xc5\xabcor'.decode() # 'commandūcor' b'command\xc5\xabc\xc4\x81ris'.decode() # 'commandūcāris' (默认值)。您可以在Python .decode中阅读有关字符编码的更多信息。

 const glanceGames = this.state.gameData.map(game => {
        return <GameTable
            key={game.id}
            home_team_name={game.home_name_abbrev}
            away_team_name={game.away_name_abbrev}
            home_score={game.linescore.r.home}
            away_score={game.linescore.r.away}
            status={game.status.status}
        />
    })

答案 1 :(得分:1)

r.content返回bytes。 (相比之下,r.text returns a strrequests模块尝试根据HTTP标头猜测正确的解码,并使用该编码为您解码字节。将来也许这就是您想要使用的)。

如果r.content包含bytes,例如b'command\xc5\xabcor',那么 str(r.content)会返回str,其中以b'字符开头,以文字'结尾。

In [45]: str(b'command\xc5\xabcor')
Out[45]: "b'command\\xc5\\xabcor'"

您可以使用ast.literal_eval恢复字节:

In [46]: ast.literal_eval(str(b'command\xc5\xabcor'))
Out[46]: b'command\xc5\xabcor'

然后,您可以将这些bytes解码为str。您发布的URL声明内容为UTF-8编码:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

假设您下载的所有数据都使用相同的编码,您可以通过调用bytes.decode('utf-8')方法将内容恢复为str:

In [47]: ast.literal_eval(str(b'command\xc5\xabcor')).decode('utf-8')
Out[47]: 'commandūcor'
import ast
import requests

r = requests.get("https://www.dizionario-latino.com/dizionario-latino-flessione.php?lemma=COMMANDUCOR100", verify = False)

out = str(r.content)

with open("test.html", 'w') as file:
    file.write(out)

with open("test.html", 'r') as f_in, open("test-fixed.html", 'w') as f_out:
    broken_text = f_in.read()  
    content = ast.literal_eval(broken_text)
    assert content == r.content
    text = content.decode('utf-8')
    f_out.write(text)