我尝试使用spark将数据从Oracle数据库读入数据集,然后将数据集转换为javaRDD进行地图操作,我的代码只能存储数据集Spark官方文档显示:http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
// Apply a schema to an RDD of JavaBeans to get a DataFrame
Dataset<Row> peopleDF = spark.createDataFrame(peopleRDD, Person.class);
我们的数据从Oracle读取,如何定义Person.class,存储以将rdd转换为数据集?或者如何直接使用Java对数据集进行映射操作? 当我的代码是这样的我应该做什么?
import java.math.BigDecimal;
import java.util.Properties;
import java.util.Random;
import org.apache.commons.lang.StringUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.SparkSession;
public class FlatMapTest {
public static void main(String[] args) {
SparkConf conf = new SparkConf();
conf.set("spark.sql.warehouse.dir", "./spark-warehouse");
SparkSession spark = SparkSession.builder().master("local[3]").config(conf)
.appName("Java Spark SQL data sources example").getOrCreate();
jdbcDataSource(spark);
}
public static void jdbcDataSource(SparkSession spark) {
// 连接到数据库,获得DF对象,DF对象封装了数据库的表信息数据
Dataset<Row> jdbcDF = spark.read().format("jdbc")
.option("driver", "oracle.jdbc.driver.OracleDriver")
.option("url", "jdbc:oracle:thin:@192.168.101.207:1521:orcl")
.option("dbtable", "datamask")
.option("user", "scott")
.option("password", "tiger").load();
/*
* 创建临时表 datamask
*/
jdbcDF.createOrReplaceTempView("datamask");
Dataset<Row> sqlDF = spark.sql("select * from datamask");
JavaRDD<Object> resultRDD = sqlDF.toJavaRDD().map(
new Function<Row, Object>() {
public String call(Row row) throws Exception {
Random ran = new Random();
int r = ran.nextInt(9001) + 1000;
/*
* 将每个字段的后4位替换为随机数,替换规则可以自己设定
*/
String userName = row.getAs("USER_NAME");
userName = StringUtils.replace(userName,
StringUtils.right(userName, 4), "" + r);
String loginName = row.getAs("LOGIN_NAME");// LOGIN_NAME为数据库字段名,下同
loginName = StringUtils.replace(loginName,
StringUtils.right(loginName, 4), "" + r);
String countyCode = row.getAs("COUNTY_CODE");
countyCode = StringUtils.replace(countyCode,
StringUtils.right(countyCode, 4), "" + r);
String passwd = row.getAs("PASSWORD");
passwd = StringUtils.replace(passwd,
StringUtils.right(passwd, 4), "" + r);
String areaId = row.getAs("AREA_ID");
areaId = StringUtils.replace(areaId,
StringUtils.right(areaId, 4), "" + r);
String cityNo = row.getAs("CITY_NO");
cityNo = StringUtils.replace(cityNo,
StringUtils.right(cityNo, 4), "" + r);
String cardID = row.getAs("CARD_ID");
cardID = StringUtils.replace(cardID,
StringUtils.right(cardID, 4), "" + r);
String mobile = row.getAs("MOBILE");
mobile = StringUtils.replace(mobile,
StringUtils.right(mobile, 4), "" + r);
String email = row.getAs("EMAIL");
email = StringUtils.replace(email,
StringUtils.right(email, 4), "" + r);
BigDecimal big = row.getAs("QQ");
String qq = big.toString();
qq = StringUtils.replace(qq, StringUtils.right(qq, 4),
"" + r);
String addr = row.getAs("ADDR");
addr = StringUtils.replace(addr,
StringUtils.right(addr, 4), "" + r);
String birthday = row.getAs("BIRTHDAY");
birthday = StringUtils.replace(birthday,
StringUtils.right(birthday, 4), "" + r);
String birthday1 = row.getAs("BIRTHDAY1");
birthday1 = StringUtils.replace(birthday1,
StringUtils.right(birthday1, 4), "" + r);
String codeId = row.getAs("CODE_ID");
codeId = StringUtils.replace(codeId,
StringUtils.right(codeId, 4), "" + r);
String deptNo = row.getAs("DEPT_NO");
deptNo = StringUtils.replace(deptNo,
StringUtils.right(deptNo, 4), "" + r);
String newCode = row.getAs("NEW_CODE");
newCode = StringUtils.replace(newCode,
StringUtils.right(newCode, 4), "" + r);
String oldCode = row.getAs("OLD_CODE");
oldCode = StringUtils.replace(oldCode,
StringUtils.right(oldCode, 4), "" + r);
return loginName + "," + countyCode + "," + passwd
+ "," + areaId + "," + cityNo + "," + cardID
+ "," + mobile + "," + email + "," + qq + ","
+ addr + "," + birthday + "," + "," + birthday1
+ "," + codeId + "," + deptNo + "," + newCode
+ "," + oldCode;
}
});
Dataset<Row> peopleDF = spark.createDataFrame(resultRDD, Object.class);
String url2 = "jdbc:oracle:thin:@192.168.101.207:1521:orcl";
Properties connectionProperties2 = new Properties();
connectionProperties2.setProperty("user", "scott");// 设置用户名
connectionProperties2.setProperty("password", "tiger");// 设置密码
String table2 = "masked1";
peopleDF.write().mode(SaveMode.Append)
.jdbc(url2, table2, connectionProperties2);
}
}
答案 0 :(得分:0)
使用编码器/模式将输入数据帧转换为datset后,为了进行地图操作,您无需将数据集转换回RDD,您可以自己在数据集上执行地图,以下代码可能对您有用
Dataset<InputType> inputDataSet =someDataFrame.as(Encoders.bean(InputType.class))
Dataset<OutputType> outputDataSet=inputDataSet.map(new MapFunction<InputType, OutputType>() {
@Override
public OutputType call(InputType value) throws Exception {
// TODO Auto-generated method stub
val someOutPutVal:OutputType=null;
return someOutPutVal;
}
}, Encoders.bean(OutputType.class));
答案 1 :(得分:0)
如果你真的想要使用Person.class来跟踪转换,那么你需要定义一个pojo:创建一个名为Person的新java类,只有getters和setter,hashcode和equals,toString()方法用于字段您想要包含在Person对象中。就这么简单,那么你可以申请spark.createDataFrame(peopleRDD, Person.class);
或者:当您获得所有字段时,将其直接打包到一个Person中。
public String call(Row row) {}
...直接在这里创建一个Person对象:
public Person call(Row row) {}
那么你将直接拥有JavaRDD<Person> resultRDD
..