在节点Js中修复Facebook JSON编码

时间:2019-07-16 14:38:33

标签: javascript node.js encoding character-encoding iconv

当您下载数据时,我正在尝试解码从Facebook获得的JSON。我正在使用Node JS。数据中有许多奇怪的unicode转义,这实际上没有任何意义。示例:

"messages": [
    {
      "sender_name": "Emily Chadwick",
      "timestamp_ms": 1480314292125,
      "content": "So sorry that was in my pocket \u00f0\u009f\u0098\u0082\u00f0\u009f\u0098\u0082\u00f0\u009f\u0098\u0082",
      "type": "Generic"
    }
]

应将其解码为So sorry that was in my pocket ???。使用fs.readFileSync(filename, "utf8")可以代替我So sorry that was in my pocket ððð,这就是mojibake。

This question提到它已经搞砸了latin1编码,您可以编码为latin1然后解码为utf8。我尝试用以下方法做到这一点:

import iconv from 'iconv-lite';
function readFileSync_fixed(filename) {
    var content = fs.readFileSync(filename, "binary");
    return iconv.decode(iconv.encode(content, "latin1"), "utf-8")
}
console.log(JSON.parse(readFileSync_fixed(filename)))

但是我仍然得到mojibake版本。谁能指出我正确的方向?我不熟悉iconv在这方面的工作方式。

2 个答案:

答案 0 :(得分:1)

为此,有一个非常简单的解决方案

首先安装<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script> <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <title>Image Gallery</title> <link rel="stylesheet" type="text/css" href="styles.css" /> <link href="../jquery-ui/jquery-ui.min.css" rel="stylesheet" type="text/css" /> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script> <script src="../jquery-ui/jquery-ui.min.js" type="text/javascript"></script> </head> <body> <!-- The four columns --> <div class="row"> <div class="column"> <img src="https://pmcvariety.files.wordpress.com/2019/12/baby-yoda-plush-toy-mattel-the-mandalorian.png?w=1000&h=563&crop=1" alt="Nature" style="width:100%"> </div> <div class="column"> <img src="https://i.kinja-img.com/gawker-media/image/upload/t_original/oicrsr3wwqi6u3buvvxx.jpg" alt="Snow" style="width:100%"> </div> <div class="column"> <img src="https://images2.minutemediacdn.com/image/upload/c_crop,h_1224,w_2177,x_80,y_0/f_auto,q_auto,w_1100/v1574876645/shape/mentalfloss/609512-disney_0.jpg" alt="Mountains" style="width:100%"> </div> <div class="column"> <img src="https://static1.srcdn.com/wordpress/wp-content/uploads/2019/12/Baby-Yoda-in-The-Mandalorian-Chapter-4.jpg" alt="Lights" style="width:100%"> </div> </div> <div class="container"> <span onclick="this.parentElement.style.display='none'" class="closebtn">&times;</span> <img id="expandedImg" style="width:100%" /> </div> </body> </html>软件包

utf8

您的代码将如下所示

npm i utf8

答案 1 :(得分:0)

在某种程度上解决了...。如果有更好的方法,请告诉我。

所以,这是修改后的功能

readFacebookJson(filename) {
    var content = fs.readFileSync(filename, "utf8");
    const json = JSON.parse(converted)
    return json
}

fixEncoding(string) {
    return iconv.decode(iconv.encode(string, "latin1"), "utf8")
}

不是readFileSync()搞砸了,而是JSON.parse()。所以-我们像往常一样以utf8格式读取文件,但是,然后需要对字符串进行latin1编码/解码,这些字符串现在是JSON文件的属性,而不是在解析之前的整个JSON文件。我是用map()做的。

messages = readFacebookJson(filename).messages.map(message => {
    const toReturn = message;
    toReturn.sender_name = fixEncoding(toReturn.sender_name)
    if (typeof message.content !== "undefined") {
        toReturn.content = fixEncoding(message.content)
    }
    return toReturn;
}),

这里的问题当然是某些属性可能会丢失。因此,请确保您知道哪些属性包含哪些内容。