如何将内存中的JSON字符串读入Spark DataFrame

时间:2016-09-21 14:44:18

标签: json scala apache-spark spark-dataframe

我正在尝试将内存中的JSON 字符串快速读入Spark DataFrame中:

$fabric

我花了很多时间查看Spark API,我能找到的最好的就是像function cart(){ global $conn; // build the fabric dropdown option tags once // use as many times as you have a row ro put them in $fabric_options = ''; $query = "SELECT * FROM almofadas"; $result = mysqli_query($conn,$query2); while($rows = mysqli_fetch_assoc($run_item2)){ // oh you will need a value in value="" // or this wont be any use to you later $fabric_options .= "<option value='{$row['A_id']}'>{$rows['tecido']}</option>"; } foreach ($_SESSION as $name => $value) { if($value > 0){ if(substr($name, 0, 8 ) == "product_"){ $length = strlen($name) -8; $item_id = substr($name,8 , $length); $query = "SELECT * FROM gallery2 WHERE gallery2.id =".escape_string($item_id). ""; $run_item = mysqli_query($conn,$query); while($rows = mysqli_fetch_assoc($run_item)){ $vari = $rows['variante']; $num = $rows['title']; $id = $rows['id']; $btn_add='<a class="btn btn-success" href="cart.php?add='.$id.'"><i class="fa fa-plus fa-lg" aria-hidden="true" add_btn></i></a>'; $btn_remove = '<a class="btn btn-warning" href="cart.php?remove='.$id.'"><i class="fa fa-minus fa-lg" aria-hidden="true" remove_btn></i></a>'; $btn_delete='<a class="btn btn-default delete_btn" href="cart.php?delete='.$id.'"><i class="fa fa-times fa-lg" aria-hidden="true"></i></a>'; if($rows['variante'] < 1){ $vari=""; }else{ $vari = "-".$rows['variante']; } // now concatenate the $fabric_options string // in between this string after the select $product = ' <td style="width:100px; "><img src="../'.$rows['image'].'" style="width:90%;border: 1px solid black;"></td> <td>'.$num.''.$vari.'</td> <td> <select name="" class="form-control selectpicker" required="">' . $fabric_options . ' </select> </td> <td>'.$value.'</td> <td>R$100,00</td> <td>sub.total</td> <td> '.$btn_add.' '.$btn_remove.' '.$btn_delete.' </td> </tr>'; echo $product; } } } } } ?> 这样使用:

var someJSON : String = getJSONSomehow()
val someDF : DataFrame = magic.convert(someJSON)

但这感觉有些尴尬/不稳定,并施加以下限制:

  1. 它要求我将JSON格式化为每行一个对象(per documentation);和
  2. 它迫使我将JSON写入临时文件,这很慢而且很尴尬;和
  3. 它迫使我随着时间的推移清理临时文件,这很麻烦,对我来说感觉“错误”
  4. 所以我问:有没有一种直接且更有效的方法将JSON字符串转换为Spark DataFrame?

1 个答案:

答案 0 :(得分:8)

来自Spark SQL指南:

val otherPeopleRDD = spark.sparkContext.makeRDD(
"""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()

这将从中间RDD创建一个DataFrame(通过传递String创建)。