我尝试在spark中读取csv文件,我想分割以逗号分隔的行,以便我有一个带有二维数组的RDD。我是Spark的新手。
我试着这样做:
public class SimpleApp
{
public static void main(String[] args) throws Exception
{
String master = "local[2]";
String csvInput = "/home/userName/Downloads/countrylist.csv";
String csvOutput = "/home/userName/Downloads/countrylist";
JavaSparkContext sc = new JavaSparkContext(master, "loadwholecsv", System.getenv("SPARK_HOME"), System.getenv("JARS"));
JavaRDD<String> csvData = sc.textFile(csvInput, 1);
JavaRDD<String> words = csvData.map(new Function <List<String>>() { //line 43
@Override
public List<String> call(String s) {
return Arrays.asList(s.split("\\s*,\\s*"));
}
});
words.saveAsTextFile(csvOutput);
}
}
这应该拆分行并返回一个ArrayList。但我不确定这一点。 我收到这个错误:
SimpleApp.java:[43,58] wrong number of type arguments; required 2
答案 0 :(得分:8)
因此该计划存在两个小问题。首先,你可能想要flatMap而不是map,因为你试图返回单词的RDD而不是单词列表的RDD,我们可以使用flatMap来平坦化结果。另一个是,我们的函数类还需要调用它的输入类型。我用以下内容替换了JavaRDD字样......
JavaRDD<String> words = rdd.flatMap(
new FlatMapFunction<String, String>() { public Iterable<String> call(String s) {
return Arrays.asList(s.split("\\s*,\\s*"));
}});
答案 1 :(得分:0)
这是你应该做的......
//======Using flatMap(RDD of words)==============
JavaRDD<String> csvData = spark.textFile(GlobalConstants.STR_INPUT_FILE_PATH, 1);
JavaRDD<String> counts = csvData.flatMap(new FlatMapFunction<String, String>() {
//line 43
@Override
public Iterable<String> call(String s) {
return Arrays.asList(s.split("\\s*,\\s*"));
}
});
//======Using map(RDD of Lists of words)==============
JavaRDD<String> csvData = spark.textFile(GlobalConstants.STR_INPUT_FILE_PATH, 1);
JavaRDD<List<String>> counts = csvData.map(new Function <String, List<String>>() { //line 43
@Override
public List<String> call(String s) {
return Arrays.asList(s.split("\\s*,\\s*"));
}
});
//=====================================
counts.saveAsTextFile(GlobalConstants.STR_OUTPUT_FILE_PATH);
答案 2 :(得分:0)
这是Java https://opencredo.com/data-analytics-using-cassandra-and-spark/教程中的代码示例。
Scala代码:
/* 1*/ val includedStatuses = Set("COMPLETED", "REPAID")
/* 2*/ val now = new Date();
/* 3*/ sc.cassandraTable("cc", "cc_transactions")
/* 4*/ .select("customerid", "amount", "card", "status", "id")
/* 5*/ .where("id < minTimeuuid(?)", now)
/* 6*/ .filter(includedStatuses contains _.getString("status"))
/* 7*/ .keyBy(row => (row.getString("customerid"), row.getString("card")))
/* 8*/ .map { case (key, value) => (key, value.getInt("amount")) }
/* 9*/ .reduceByKey(_ + _)
/*10*/ .map { case ((customerid, card), balance) => (customerid, card, balance, now) }
/*11*/ .saveToCassandra("cc", "cc_balance", SomeColumns("customerid", "card", "balance", "updated_at"))
Java代码:
SparkContextJavaFunctions functions = CassandraJavaUtil.javaFunctions(ProjectPropertie.context);
JavaRDD<Balance> balances = functions.cassandraTable(ProjectPropertie.KEY_SPACE, Transaction.TABLE_NAME)
.select("customerid", "amount", "card", "status", "id")
.where("id < minTimeuuid(?)", date)
.filter( row -> row.getString("status").equals("COMPLETED") )
.keyBy(row -> new Tuple2<>(row.getString("customerid"), row.getString("card")))
.mapToPair( row -> new Tuple2<>(row._1,row._2.getInt("amount")))
.reduceByKey( (i1,i2) -> i1.intValue()+i2.intValue())
.flatMap(new FlatMapFunction<Tuple2<Tuple2<String, String>, Integer>, Balance>() {
/**
*
*/
private static final long serialVersionUID = 1L;
@Override
public Iterator<Balance> call(Tuple2<Tuple2<String, String>, Integer> r) throws Exception {
List<Balance> list = new ArrayList<Balance>();
list.add(new Balance(r._1._1, r._1._2, r._2,reportDate));
return list.iterator();
}
}).cache();
ProjectPropertie.context
是SparkContext
的地方
这是获取SparkContext的方法(每个JVM仅应使用一个上下文):
SparkConf conf = new SparkConf(true).setAppName("App_name").setMaster("local[2]").set("spark.executor.memory", "1g")
.set("spark.cassandra.connection.host", "127.0.0.1,172.17.0.2")
.set("spark.cassandra.connection.port", "9042")
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra");
SparkContext context = new SparkContext(conf);
对于数据源,我使用的是Cassandra,其中172.17.0.2是运行我的Cassandra节点的docker容器,而127.0.0.1是主机(在这种情况下是本地)