dask-将read_json放入数据框ValueError

时间:2019-11-07 20:39:03

标签: dask

这里是一个最小的示例:我有一个json文件xaa.json,其内容如下所示(来自stackoverflow归档文件的两行):

[
  {"Id": 11, "Body": "<p>Given a specific <code>DateTime</code> value", "Title": "Calculate relative time in C#", "Comments": "There is the .net package https://github.com/NickStrupat/TimeAgo which pretty much does what is being asked."},
  {"Id": 7888, "Body": "<p>You need to use an <a href=\\"http://en.cppreference.com/w/cpp/io/basic_ifstream\\" rel=\\"noreferrer\\"><code>ifstream</code></a> if you just want to read (use an <code>ofstream</code> to write, or an <code>fstream</code> for both).</p>&#xA;&#xA;<p>To open a file in text mode, do the following:</p>&#xA;&#xA;<pre><code>ifstream in(\\"filename.ext\\", ios_base::in); // the in flag is optional&#xA;</code></pre>&#xA;&#xA;<p>To open a file in binary mode, you just need to add the \\"binary\\" flag.</p>&#xA;&#xA;<pre><code>ifstream in2(\\"filename2.ext\\", ios_base::in | ios_base::binary ); &#xA;</code></pre>&#xA;&#xA;<p>Use the <a href=\\"http://en.cppreference.com/w/cpp/io/basic_istream/read\\" rel=\\"noreferrer\\"><code>ifstream.read()</code></a> function to read a block of characters (in binary or text mode).  Use the <a href=\\"http://en.cppreference.com/w/cpp/string/basic_string/getline\\" rel=\\"noreferrer\\"><code>getline()</code></a> function (it's global) to read an entire line.</p>&#xA;", "Title": null, "Comments": "+1 for noting that the global getline() function is to be used instead of the member function."}
]

我想将这样的json文件加载到dask数据框中。我使用:

so_posts_df = dd.read_json('./xaa.json', orient='columns').compute()

我收到此错误:

ValueError: Unexpected character found when decoding object value

查看内容后,我发现是由“ \\”引起的。因此,当我删除它们时(编辑器-IntelliJ说它很干净,看起来很漂亮JSON),当我运行相同的read_json时,它能够读入df并很好地显示它们。

因此,我有2个问题:(a)read_json参数“ errors”的值是什么? (b)在读入dask数据框之前,如何正确预处理json文件?双引号和双转义符的出现似乎引起了问题。

[这可能根本不是个麻烦的问题...] ...

1 个答案:

答案 0 :(得分:1)

这也以pandas.read_json失败。我建议首先尝试使Pandas正常运行,然后再使用dask dataframe尝试相同的工作负载。问熊猫问题时,您可能会得到更好的支持。