用Spark加入HDFS中的两个数据文件?

时间:2016-11-23 23:45:58

标签: apache-spark hdfs

我有两个数据集已经使用相同的分区器进行了分区并存储在HDFS中。这些数据集是我们无法控制的两个不同Spark作业的输出。现在,我想加入这两个数据集来产生不同的信息。

<?php

$email = TRIM($_REQUEST['email']);
$pword = TRIM($_REQUEST['password']);


if( isset($email) && isset($password)){

    $pword = mysqli_real_escape_string($con, $pword);
    $email = mysqli_real_escape_string($con, $email);

    require('includes/dbcon.php');

    $sql = mysqli_query($con, "SELECT ALL from users WHERE email = '$email' AND BINARY password = '$pword'");

    $checkrows=mysqli_num_rows($sql);
    $row = mysqli_fetch_array($sql);
}

        $userID = $row('id');
        $name = $row('username');

            if($checkrow>0){
                mysqli_close($con);
                header("location:index.html#mapPage")
}

else{
  $loginError = "Wrong Username or Pasword";  
    echo $loginError;
}

1 个答案:

答案 0 :(得分:0)

您可以尝试创建2个数据框并使用SQL加入它们。请找到以下代码。

import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.Encoder

// For implicit conversions from RDDs to DataFrames
import spark.implicits._

case class struc_dataset(ORDER_ID: String,CUSTOMER_ID: String, ITEMS:String)

//Read file1
val File1DF = spark.sparkContext
   .textFile("temp/src/file1.txt")
   .map(_.split("\t"))
   .map(attributes => struc_dataset(attributes(0), attributes(1),attributes(3))).toDF()

//Register as Temp view - Dataset1
File1DF.createOrReplaceTempView("Datset1")

//Read file2
val File2DF = spark.sparkContext
   .textFile("temp/src/file2.txt")
   .map(_.split("\t"))
   .map(attributes => struc_dataset(attributes(0),attributes(1),attributes(3))).toDF()

//Register as Temp view - Dataset2
File2DF.createOrReplaceTempView("Datset2")

// SQL statement to create final dataframe (JOIN)
val finalDF = spark.sql("SELECT * FROM Dataset1 ds1 JOIN Dataset2 ds2 on ds1.ORDER_ID=ds2.ORDER_ID AND ds1.CUSTOMER_ID=ds2.CUSTOMER_ID")

finalDF.show()