我从cassandra表中选择值并将它们存储在数据集中,如下所示:
b = int(raw_input("Where do you want to start? > "))
a = int(raw_input("How Many Labels do you need? >"))
for x in range(b, b+a):
with open("Test_file{}.txt".format(x), "w") as file:
file.write("This is me typing {}".format(x))
现在我有一个POJO类GroupClass,其变量为url,sourceip和destionationip。
Dataset query =spark.sql("select url,sourceip,destinationip from traffic_data");
List<Row> = query.collectAsList();
答案 0 :(得分:0)
从技术上讲,你可以,但这会在运行时抛出ClassCastException
。
在这种情况下,最佳做法是使用Copy Constructor。
答案 1 :(得分:0)
我来自scala,但我相信在java中有类似的方式。
可能的解决方案如下:
val query =spark.sql("select url,sourceip,destinationip from traffic_data").as[GroupClass]
现在查询值的类型为Dataset[GroupClass]
,因此调用collectAsList()
方法会重新导入List [GroupClass]
val list = query.collectAsList();
另一个解决方案(我认为你必须使用streams
在java中执行相同的操作)是map
Row
来自GroupClass
的列表,如下所示:val query =spark.sql("select url,sourceip,destinationip from traffic_data")
val list = query.collectAsList();
val mappedList = list.map {
case Row(url: String,sourceip: String,destinationip: String) =>
GroupClass(url, sourceip, destinationip)
}
String
我认为所有属性(url,sourceip,destinationip)都有GroupedClass
您必须创建GroupClass(url: String,sourceip: String,destinationip: String)
:
logger [OPTIONS] [MESSAGE]
Write MESSAGE to the system log. If MESSAGE is omitted, log stdin.
Options:
-s Log to stderr as well as the system log
-t TAG Log using the specified tag (defaults to user name)
-p PRIO Priority (numeric or facility.level pair)
希望有所帮助
答案 2 :(得分:0)
你应该使用编码器
Dataset schools = context
.read()
.json("/schools.json")
.as(Encoders.bean(University.class));
可在此处找到更多信息https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html 或者https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-Encoder.html