我有这样的df
+-----+-------+------------+---+---+----+------+--------------------+
|CHROM| POS| ID|REF|ALT|QUAL|FILTER| INFO|
+-----+-------+------------+---+---+----+------+--------------------+
| 1|1014143| rs786201005| C| T| .| .|RS=786201005;RSPO...|
| 1|1014228| rs1921| G|A,C| .| .|RS=1921;RSPOS=101...|
| 1|1014316| rs672601345| C| CG| .| .|RS=672601345;RSPO...|
| 1|1014359| rs672601312| G| T| .| .|RS=672601312;RSPO...|
| 1|1020183| rs539283387| G| C| .| .|RS=539283387;RSPO...|
| 1|1020216| rs764659938| C| G| .| .|RS=764659938;RSPO...|
| 1|1020217| rs115173026| G| T| .| .|RS=115173026;RSPO...|
| 1|1020221|rs1057523287| C| T| .| .|RS=1057523287;RSP...|
| 1|1020239| rs201073369| G|A,C| .| .|RS=201073369;RSPO...|
| 1|1022188| rs115704555| A| G| .| .|RS=115704555;RSPO...|
+-----+-------+------------+---+---+----+------+--------------------+
我的信息栏有多个值,由';'分隔它们的形式为' column_name = value'。我希望我的df信息列根据各自的值分成多列,如此
Pre_Col| Info | RS | RSPOS |dbSNPBuildID| SSR |...|
-------+--------------------+------------+-------+------------+-----+---+
... |RS=786201005;RSPO...| 786201005 |1012143| 144 | 0 |...|
... |RS=115173026;RSPO...| 115173026 |9043523| 123 | 2 |...|
info列可以有多个变量值。 RS值可能不在其他行中,其他值也可以是相同的情况。在那种情况下,我希望RS值为' null'。我通过地图驾驶这个df。
在提出一条建议之后,我编辑了我的代码并获得了以下结果
+-----+-------+------------+---+---+----+------+--------------------+-----+
|CHROM| POS| ID|REF|ALT|QUAL|FILTER| INFO| kvs|
+-----+-------+------------+---+---+----+------+--------------------+-----+
| 1|1014143| rs786201005| C| T| .| .|RS=786201005;RSPO...|Map()|
| 1|1014228| rs1921| G|A,C| .| .|RS=1921;RSPOS=101...|Map()|
| 1|1014316| rs672601345| C| CG| .| .|RS=672601345;RSPO...|Map()|
| 1|1014359| rs672601312| G| T| .| .|RS=672601312;RSPO...|Map()|
| 1|1020183| rs539283387| G| C| .| .|RS=539283387;RSPO...|Map()|
| 1|1020216| rs764659938| C| G| .| .|RS=764659938;RSPO...|Map()|
| 1|1020217| rs115173026| G| T| .| .|RS=115173026;RSPO...|Map()|
| 1|1020221|rs1057523287| C| T| .| .|RS=1057523287;RSP...|Map()|
| 1|1020239| rs201073369| G|A,C| .| .|RS=201073369;RSPO...|Map()|
| 1|1022188| rs115704555| A| G| .| .|RS=115704555;RSPO...|Map()|
+-----+-------+------------+---+---+----+------+--------------------+-----+
我的架构是
root
|-- CHROM: string (nullable = true)
|-- POS: string (nullable = true)
|-- ID: string (nullable = true)
|-- REF: string (nullable = true)
|-- ALT: string (nullable = true)
|-- QUAL: string (nullable = true)
|-- FILTER: string (nullable = true)
|-- INFO: string (nullable = true)
|-- kvs: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
我可以将这些地图值进一步拆分成列吗?
任何帮助将不胜感激。
答案 0 :(得分:1)
从PySpark converting a column of type 'map' to multiple columns in a dataframe调整答案:
from pyspark.sql.functions import col, udf, explode
@udf("map<string,string>")
def to_map(s):
if s:
kvs = [x.split("=") for x in s.split(";")]
return {kv[0]: kv[1] for kv in kvs if len(kvs) == 2}
with_map = df.withColumn("kvs", to_map("INFO"))
keys = (with_map
.select(explode("kvs"))
.select("key")
.distinct()
.rdd.flatMap(lambda x: x)
.collect())
with_map.select(*["*"] + [col("kvs").getItem(k).alias(k) for k in keys])
对于旧版本:
from pyspark.sql.types import *
def to_map_(s):
if s:
kvs = [x.split("=") for x in s.split(";")]
return {kv[0]: kv[1] for kv in kvs if len(kvs) == 2}
to_map = udf(to_map_, MapType(StringType(), StringType()))