Question

我是Spark的新手。我有一个包含某些分析结果的数据框。我将该数据帧转换为JSON，因此我可以在Flask App中显示它：

results = result.toJSON().collect()

我的json文件中的示例条目如下。然后我尝试运行for循环以获得特定结果：

{"userId":"1","systemId":"30","title":"interest"}

for i in results:
    print i["userId"]

这根本不起作用，我得到的错误如：Python（json）：TypeError：期望的字符串或缓冲区

我使用json.dumps和json.loads但仍然没有 - 我继续得到错误，例如字符串索引必须是整数，以及上面的错误。

然后我尝试了这个：

  print i[0]

这给了我json“{”而不是第一行的第一个字符。我真的不知道该怎么办，谁能告诉我哪里出错？

非常感谢。

Answer 1

如果result.toJSON().collect()的结果是JSON编码的字符串，那么您可以使用json.loads()将其转换为dict。您遇到的问题是，当您使用dict循环迭代for时，您将获得dict的密钥。在for循环中，您将密钥视为dict，而实际上它只是string。试试这个：

# toJSON() turns each row of the DataFrame into a JSON string
# calling first() on the result will fetch the first row.
results = json.loads(result.toJSON().first())

for key in results:
    print results[key]

# To decode the entire DataFrame iterate over the result
# of toJSON()

def print_rows(row):
    data = json.loads(row)
    for key in data:
        print "{key}:{value}".format(key=key, value=data[key])


results = result.toJSON()
results.foreach(print_rows)

编辑：问题是collect返回list，而不是dict。我已经更新了代码。请务必阅读文档。

collect（）返回包含此RDD中所有元素的列表。

注意只有在生成的数组中，才应使用此方法   由于所有数据都被加载到驱动程序中，因此预计会很小   存储器中。

EDIT2：我无法强调，总是read the docs.

EDIT3：查看here。

Answer 2

这对我有用：

df_json = df.toJSON()

for row in df_json.collect():
    #json string
    print(row) 

    #json object
    line = json.loads(row) 
    print(line[some_key])

请记住，使用.collect（）是不可取的，因为它会收集分布式数据帧，并且无法使用数据框。

Answer 3

Dim exApp As New Excel.Application
Dim exWb As Excel.Workbook
Set exWb = exApp.Workbooks.Add

将数据帧转换为JSON（在pyspark中），然后选择所需的字段

3 个答案: