我有一个包含4列的数据框,第一列是键,第四列是值。但是有时键可以具有变体,在这种情况下,第一列为空,并且键及其变体分别存储在第二列和第三列中。
如何将我的数据框转换为只有2列:键和值?
例如,假设我正在为我的用户建立一个居住数据框架,并且收到下表:
.flexbox {
display: flex;
width: 300px;
}
.item:nth-of-type(1) { flex-grow: 3; }
.item:nth-of-type(2) { flex-grow: 1; }
.item:nth-of-type(3) { flex-grow: 1; }
.item:nth-of-type(4) { flex-grow: 1; }
.one {
background-color: red;
flex-grow: 3;
}
.two {
background-color: blue;
}
.three {
background-color: green;
}
.four {
background-color: yellow;
}
.item {
padding: 10px;
}
如何将其转换为:
from pyspark.sql import Row
l = [("Joe","", "", "London"),
("", "Alice", "Bob", "Paris" ),
("Sarah", "", "", "New-York"),
("", "John", "Edmund", "Berlin")]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(single=x[0], partner1=x[1], partner2=x[2], town=x[3]))
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.show()
+--------+--------+------+--------+
|partner1|partner2|single| town |
+--------+--------+------+--------+
| | | Joe | London |
| Alice | Bob | | Paris |
| | | Sarah|New-York|
| John | Edmund | | Berlin |
+--------+--------+------+--------+
答案 0 :(得分:1)
我可以想到一种通过使用concat_ws
然后再split
来连接人们的方法。然后explode
的结果来获取表格。
schpeep = schemaPeople. \
select('town', func.split(func.concat_ws('|', 'partner1', 'partner2', 'single'), '\|').alias('people')). \
withColumn('name', func.explode('people')). \
drop('people'). \
filter(func.col('name') != '')
schpeeps.show()
+--------+------+
| town| name|
+--------+------+
| London| Joe|
| Paris| Alice|
| Paris| Bob|
|New-York| Sarah|
| Berlin| John|
| Berlin|Edmund|
+--------+------+
答案 1 :(得分:1)
您只需完成union
df.select(
F.col("partner1").alias("name"),
F.col("town")
).where("name <> ''")\
.union(
df.select(
F.col("partner2").alias("name"),
F.col("town")
).where("name <> ''")
)\
.union(
df.select(
F.col("single").alias("name"),
F.col("town")
).where("name <> ''")
)