在数据框列中拆分非嵌套数据

时间:2019-12-02 13:51:04

标签: python pyspark

我有一个包含4列的数据框,第一列是键,第四列是值。但是有时键可以具有变体,在这种情况下,第一列为空,并且键及其变体分别存储在第二列和第三列中。

如何将我的数据框转换为只有2列:键和值?

例如,假设我正在为我的用户建立一个居住数据框架,并且收到下表:

.flexbox {
  display: flex;
  width: 300px;
}

.item:nth-of-type(1) { flex-grow: 3; }
.item:nth-of-type(2) { flex-grow: 1; }
.item:nth-of-type(3) { flex-grow: 1; }
.item:nth-of-type(4) { flex-grow: 1; }

.one {
  background-color: red;
  flex-grow: 3;
}

.two {
  background-color: blue;
}

.three {
  background-color: green;
}

.four {
  background-color: yellow;
}

.item {
  padding: 10px;
}

如何将其转换为:

from pyspark.sql import Row
l = [("Joe","", "", "London"),
    ("", "Alice", "Bob", "Paris" ),
    ("Sarah", "", "", "New-York"),
    ("", "John", "Edmund", "Berlin")]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(single=x[0], partner1=x[1], partner2=x[2], town=x[3]))
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.show()

 +--------+--------+------+--------+
 |partner1|partner2|single| town   | 
 +--------+--------+------+--------+
 |        |        |  Joe | London | 
 | Alice  | Bob    |      | Paris  | 
 |        |        | Sarah|New-York| 
 | John   | Edmund |      | Berlin |
 +--------+--------+------+--------+

2 个答案:

答案 0 :(得分:1)

我可以想到一种通过使用concat_ws然后再split来连接人们的方法。然后explode的结果来获取表格。

schpeep = schemaPeople. \
    select('town', func.split(func.concat_ws('|', 'partner1', 'partner2', 'single'), '\|').alias('people')). \
    withColumn('name', func.explode('people')). \
    drop('people'). \
    filter(func.col('name') != '')

schpeeps.show()

+--------+------+
|    town|  name|
+--------+------+
|  London|   Joe|
|   Paris| Alice|
|   Paris|   Bob|
|New-York| Sarah|
|  Berlin|  John|
|  Berlin|Edmund|
+--------+------+

答案 1 :(得分:1)

您只需完成union

df.select(
    F.col("partner1").alias("name"),
    F.col("town")
).where("name <> ''")\
.union(
    df.select(
        F.col("partner2").alias("name"),
        F.col("town")
    ).where("name <> ''")
)\
.union(
    df.select(
        F.col("single").alias("name"),
        F.col("town")
    ).where("name <> ''")
)