我想在pyspark中将数据从4x3重塑为2x2,而不进行聚合。我当前的输出如下:
// ==ClosureCompiler==
// @language_out ES5
// @output_file_name default.js
// --jscomp_off=missingProperties
// @compilation_level ADVANCED_OPTIMIZATIONS
// ==/ClosureCompiler==
if (test_connection == true) {
jquery.ajax({
type: "GET",
url: url + index,
async: false,
headers: http_headers,
error: function(response) {
throw new Error("Error connecting to Search Engine: " + response.statusText);
}
});
error: function(response) {
throw new Error("Error querying Search Engine: " + response.statusText);
console.log(errorThrown);
我想要的是一个列联表,其中第二列为两个新的二进制列(columns = ['FAULTY', 'value_HIGH', 'count']
vals = [
(1, 0, 141),
(0, 0, 140),
(1, 1, 21),
(0, 1, 12)
]
,value_HIGH_1
)和value_HIGH_0
列中的值-含义:
count
答案 0 :(得分:2)
您可以将pivot
与最大伪造一起使用(因为每个组只有一个元素):
import pyspark.sql.functions as F
df.groupBy('FAULTY').pivot('value_HIGH').agg(F.max('count')).selectExpr(
'FAULTY', '`1` as value_high_1', '`0` as value_high_0'
).show()
+------+------------+------------+
|FAULTY|value_high_1|value_high_0|
+------+------------+------------+
| 0| 12| 140|
| 1| 21| 141|
+------+------------+------------+
答案 1 :(得分:1)
natural way to do this是使用groupby
和pivot
,但是如果要避免任何聚合,可以使用filter
和join
import pyspark.sql.functions as f
df.where("value_HIGH = 1").select("FAULTY", f.col("count").alias("value_HIGH_1"))\
.join(
df.where("value_HIGH = 0").select("FAULTY", f.col("count").alias("value_HIGH_1")),
on="FAULTY"
)\
.show()
#+------+------------+------------+
#|FAULTY|value_HIGH_1|value_HIGH_1|
#+------+------------+------------+
#| 0| 12| 140|
#| 1| 21| 141|
#+------+------------+------------+