将多个数据集组合到单个数据集,而不使用Apache Spark sql中的unionAll函数

时间:2017-07-18 08:37:12

标签: apache-spark apache-spark-sql

我的数据集如下

  Dataset 1:

+----------+--------------------+---------+---+
|      Time|             address|     Date|value|sample
+----------+--------------------+---------+---+------+
|8:00:00 AM| AAbbbbbbbbbbbbbbbb|12/9/2014|  1  |0    |
|8:31:27 AM| AAbbbbbbbbbbbbbbbb|12/9/2014|  1  |0    |
+----------+--------------------+---------+---+------+

Dataset 2:


|       Time|            Location|     Date|sample|value
+-----------+--------------------+---------+------+------+
| 8:45:00 AM| AAbbbbbbbbbbbbbbbb|12/9/2016|     5 | 0    |
| 9:15:00 AM| AAbbbbbbbbbbbbbbbb|12/9/2016|     5 | 0    |
+-----------+--------------------+---------+------+------+

我正在使用以下unionAll()函数来组合ds1和ds2,

Dataset<Row> joined = dataset1.unionAll(dataset2).distinct();

有没有更好的方法来组合这个ds1和ds2,因为在Spark 2.x中不推荐使用unionAll()函数。?

1 个答案:

答案 0 :(得分:1)

您可以使用#define BINARY_FILE_NAME_MAXLEN 10 typedef struct _prv_instance_ { /* * The first two are mandatories and represent the pointer to the next instance and the ID of this one. The rest * is the instance scope user data (uint8_t power in this case) */ struct _prv_instance_ * next; // matches lwm2m_list_t::next uint16_t shortID; // matches lwm2m_list_t::id uint8_t power; uint8_t reset; double dec; char binary_filename[BINARY_FILE_NAME_MAXLEN]; } prv_instance_t; 合并两个数据框/数据集

union()

输出:

df1.union(df2)

它还会删除重复的行

希望这有帮助!