Question

我从cassandra表中选择值并将它们存储在数据集中，如下所示：

b = int(raw_input("Where do you want to start? > "))
a = int(raw_input("How Many Labels do you need? >"))
for x in range(b, b+a):
    with open("Test_file{}.txt".format(x), "w") as file:
        file.write("This is me typing {}".format(x))

现在我有一个POJO类GroupClass，其变量为url，sourceip和destionationip。

Dataset query =spark.sql("select url,sourceip,destinationip from traffic_data");
List<Row> = query.collectAsList();

Answer 1

从技术上讲，你可以，但这会在运行时抛出ClassCastException。

在这种情况下，最佳做法是使用Copy Constructor。

Answer 2

我来自scala，但我相信在java中有类似的方式。

可能的解决方案如下：

val query =spark.sql("select url,sourceip,destinationip from traffic_data").as[GroupClass]

现在查询值的类型为Dataset[GroupClass]，因此调用collectAsList()方法会重新导入List [GroupClass]

val list = query.collectAsList();

另一个解决方案（我认为你必须使用streams在java中执行相同的操作）是map Row来自GroupClass的列表，如下所示：val query =spark.sql("select url,sourceip,destinationip from traffic_data") val list = query.collectAsList(); val mappedList = list.map { case Row(url: String,sourceip: String,destinationip: String) => GroupClass(url, sourceip, destinationip) }

String

我认为所有属性（url，sourceip，destinationip）都有GroupedClass

您必须创建GroupClass(url: String,sourceip: String,destinationip: String)：

logger [OPTIONS] [MESSAGE]

Write MESSAGE to the system log. If MESSAGE is omitted, log stdin.

Options:

        -s      Log to stderr as well as the system log
        -t TAG  Log using the specified tag (defaults to user name)
        -p PRIO Priority (numeric or facility.level pair)

希望有所帮助

Answer 3

你应该使用编码器

Dataset schools = context
.read()
.json("/schools.json")
.as(Encoders.bean(University.class));

可在此处找到更多信息https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html 或者https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-Encoder.html

将List <row>列入java中的List <t>

3 个答案: