每天,我收到的文件都有〜2k列。有900个“关系”列。例如:
data.id | name | AGE |data.rel.1 | data.rel.2 | data.rel.1.type | data.rel.2.type
12 | JOE | 25 | ASDF | QWER | order | order
23 | TIM | 20 | AAAA | SSSS | product | product
34 | BRAD | 32 | XXXX | null | order | null
11 | MATT | 23 | ASDF | QWER | agreement | agreement
我需要整理数据并创建“ id-rel-rel type”数据框,该数据框将仅包含data.id,data.rel和data.rel.type
data.id | data.rel | data.rel.type
12 | ASDF | order
12 | QWER | order
23 | AAAA | product
23 | SSSS | product
34 | XXXX | order
11 | ASDF | agreement
11 | QWER | agreement
此解决方案似乎只适用于一列,但是我不确定如何将rel.type列合并到相同的逻辑中:
pattern = '/*rel/*'
def explode(row,pattern):
for c in row.asDict():
if re.search(pattern, c):
yield (row['data_id'],row[c])
df.rdd.flatMap(lambda r:explode(r,pattern))
.toDF(['data_id','data_rel'])
.filter(F.col('data_rel').isNotNull())
.show()
有什么想法吗?
答案 0 :(得分:3)
这是一个解决方案
import pyspark.sql.functions as F
df = spark.createDataFrame(
[(12, 'JOE', 25, 'ASDF', 'QWER', 'ZXCV'),
(23, 'TIM', 20, 'AAAA', 'SSSS', 'DDDD'),
(34, 'BRAD', 32, 'XXXX', None, None),
(11, 'MATT', 23, 'ASDF', 'QWER', None)],
['data_id','name','AGE','data_rel_1','data_rel_2','data_rel_3']
)
# Create an array of the columns you want
cols = F.array(
*[F.col(c).alias(c) for c in ['data_rel_1', 'data_rel_2', 'data_rel_3']]
)
df.withColumn(
"data_rel", cols
).select(
'data_id',F.explode('data_rel').alias('data_rel')
).filter(
F.col('data_rel').isNotNull()
).show()
结果为:
+-------+--------+
|data_id|data_rel|
+-------+--------+
| 12| ASDF|
| 12| QWER|
| 12| ZXCV|
| 23| AAAA|
| 23| SSSS|
| 23| DDDD|
| 34| XXXX|
| 11| ASDF|
| 11| QWER|
+-------+--------+
编辑使用rdd并爆炸的另一种解决方案可以将pattern用作参数(这可能不会导致具有更多cols的异常)
import pyspark.sql.functions as F
#takes pattern, and explodes those cols which match pattern
def explode(row,pattern):
import re
for c in row.asDict():
if re.search(pattern, c):
yield (row['data_id'],row[c])
df = spark.createDataFrame(
[(12, 'JOE', 25, 'ASDF', 'QWER', 'ZXCV'),
(23, 'TIM', 20, 'AAAA', 'SSSS', 'DDDD'),
(34, 'BRAD', 32, 'XXXX', None, None),
(11, 'MATT', 23, 'ASDF', 'QWER', None)],['data_id','name','AGE','data_rel_1','data_rel_2','data_rel_3']
)
pattern = '/*rel/*'
df.rdd.flatMap(
lambda r:explode(r,pattern)
).toDF(
['data_id','data_rel']
).filter(
F.col('data_rel').isNotNull()
).show()
答案 1 :(得分:1)
我不太了解python,我无法在此处给出答案..用scala写。您可以尝试翻译成python。 -首先选择data.id和data.rel.1作为df1 类似df2的data.id和data.rel.2 以及data.id和data.rel.3作为df3
现在您有3个数据框,然后合并它们,您将获得高于输出的结果
import org.apache.spark.sql.{ SparkSession} /** * Created by Ram Ghadiyaram */ object DFUnionExample { def main(args: Array[String]) { val sparkSession = SparkSession.builder. master("local") .appName("DFUnionExample") .getOrCreate() import sparkSession.implicits._ val basedf = Seq((12, "JOE", 25, "ASDF", "QWER", "ZXCV"), (23, "TIM", 20, "AAAA", "SSSS", "DDDD"), (34, "BRAD", 32, "XXXX", null, null), (11, "MATT", 23, "ASDF", "QWER", null) ).toDF("data.id", "name", "AGE", "data.rel.one", "data.rel.two", "data.rel.three") basedf.show import org.apache.spark.sql.functions._ val df1 = basedf.select(col("`data.id`"),col("`data.rel.one`")) val df2 =basedf.select(col("`data.id`"),col("`data.rel.two`")) val df3 = basedf.select(col("`data.id`"),col("`data.rel.three`")) df1.union(df2).union(df3) .select(col("`data.id`"),col("`data.rel.one`").as("data.rel")) .filter(col("`data.rel`").isNotNull) .sort(col("`data.id`")).show } }
结果:
+-------+----+---+------------+------------+--------------+
|data.id|name|AGE|data.rel.one|data.rel.two|data.rel.three|
+-------+----+---+------------+------------+--------------+
| 12| JOE| 25| ASDF| QWER| ZXCV|
| 23| TIM| 20| AAAA| SSSS| DDDD|
| 34|BRAD| 32| XXXX| null| null|
| 11|MATT| 23| ASDF| QWER| null|
+-------+----+---+------------+------------+--------------+
+-------+--------+
|data.id|data.rel|
+-------+--------+
| 11| QWER|
| 11| ASDF|
| 12| ASDF|
| 12| QWER|
| 12| ZXCV|
| 23| AAAA|
| 23| DDDD|
| 23| SSSS|
| 34| XXXX|
+-------+--------+