这里是一个最小的示例:我有一个json文件xaa.json,其内容如下所示(来自stackoverflow归档文件的两行):
[
{"Id": 11, "Body": "<p>Given a specific <code>DateTime</code> value", "Title": "Calculate relative time in C#", "Comments": "There is the .net package https://github.com/NickStrupat/TimeAgo which pretty much does what is being asked."},
{"Id": 7888, "Body": "<p>You need to use an <a href=\\"http://en.cppreference.com/w/cpp/io/basic_ifstream\\" rel=\\"noreferrer\\"><code>ifstream</code></a> if you just want to read (use an <code>ofstream</code> to write, or an <code>fstream</code> for both).</p>

<p>To open a file in text mode, do the following:</p>

<pre><code>ifstream in(\\"filename.ext\\", ios_base::in); // the in flag is optional
</code></pre>

<p>To open a file in binary mode, you just need to add the \\"binary\\" flag.</p>

<pre><code>ifstream in2(\\"filename2.ext\\", ios_base::in | ios_base::binary ); 
</code></pre>

<p>Use the <a href=\\"http://en.cppreference.com/w/cpp/io/basic_istream/read\\" rel=\\"noreferrer\\"><code>ifstream.read()</code></a> function to read a block of characters (in binary or text mode). Use the <a href=\\"http://en.cppreference.com/w/cpp/string/basic_string/getline\\" rel=\\"noreferrer\\"><code>getline()</code></a> function (it's global) to read an entire line.</p>
", "Title": null, "Comments": "+1 for noting that the global getline() function is to be used instead of the member function."}
]
我想将这样的json文件加载到dask数据框中。我使用:
so_posts_df = dd.read_json('./xaa.json', orient='columns').compute()
我收到此错误:
ValueError: Unexpected character found when decoding object value
查看内容后,我发现是由“ \\”引起的。因此,当我删除它们时(编辑器-IntelliJ说它很干净,看起来很漂亮JSON),当我运行相同的read_json时,它能够读入df并很好地显示它们。
因此,我有2个问题:(a)read_json参数“ errors”的值是什么? (b)在读入dask数据框之前,如何正确预处理json文件?双引号和双转义符的出现似乎引起了问题。
[这可能根本不是个麻烦的问题...] ...
答案 0 :(得分:1)
这也以pandas.read_json
失败。我建议首先尝试使Pandas正常运行,然后再使用dask dataframe尝试相同的工作负载。问熊猫问题时,您可能会得到更好的支持。