假设我有一个看起来像这样的数据集:
+--------------------+---------+------+--------------------+
| transID|principal|subSeq| subTransID|
+--------------------+---------+------+--------------------+
|2116e07b-14ea-476...| bob| 4|ec463751-22ca-477...|
|3859a175-f16b-4fd...| bob| 4|ec463751-22ca-477...|
|3859a175-f16b-4fd...| bob| 7|2116e07b-14ea-476...|
+--------------------+---------+------+--------------------+
我想通过基于列transID
的最大值聚合列subSeq
来删除重复的行,但是我想使数据集不显示max(subSeq)
列,但是而是原始数据集中的subTransID
列。
如果我这样做:
dsJoin.groupBy("transID").agg(functions.max("subSeq")).show();
然后我得到
+--------------------+-----------+
| transID|max(subSeq)|
+--------------------+-----------+
|3859a175-f16b-4fd...| 7|
|2116e07b-14ea-476...| 4|
+--------------------+-----------+
已基于另一行的最大值7正确删除了列3859a175-f16b-4fd...
中值为4的重复行subSeq
。但是我想在结果数据集中显示列subTransID
!
我一定在这里很明显地错过了一些东西。
在JAVA中执行此操作。感谢您的任何建议!
答案 0 :(得分:0)
<link href="https://rawgit.com/mervick/emojionearea/master/dist/emojionearea.css" rel="stylesheet" />
<div class="container-fluid pt-3">
<form method="POST" action="" enctype="multipart/form-data">
<div class="form-group">
<font color="#a1a1a1"><span id="count"></span> characters remaining</font>
<textarea type="text" name="bio" class="form-control emoji_act" id="bio" placeholder="10 - 140 characters" onkeyup="count_char(this, 140)"></textarea>
<span id="bio_val"></span>
</div>
<div class="form-group">
<button id="ok_but" class="btn btn-primary btn-block" type="submit" name="ed_submit">
Done
</button>
</div>
</form>
</div>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"></script>
<script src="https://rawgit.com/mervick/emojionearea/master/dist/emojionearea.js"></script>
表达式中的也从其他字段中获得agg
first
答案 1 :(得分:0)
您应该将相关属性打包到一个结构中,应用聚合函数,然后再次解压缩该结构((下面的标量代码):
dsJoin.groupBy("transID")
.agg(
max(struct("subSeq","subTransID")).as("max")
)
.select("transID","max.*")
.show()