Spark Structed Streaming时如何将DataFrame中的字符串列拆分为多个列

时间:2020-04-20 14:10:58

标签: pyspark

这是当前代码:

from pyspark.sql import SparkSession

park_session = SparkSession\
    .builder\
    .appName("test")\
    .getOrCreate()

lines = spark_session\
    .readStream\
    .format("socket")\
    .option("host", "127.0.0.1")\
    .option("port", 9998)\
    .load()

The 'lines' looks like this:
+-------------+
|    value    |
+-------------+
|     a,b,c   |
+-------------+

But I want to look like this:
+---+---+---+
| a | b | c |
+---+---+---+

我尝试使用'split()'方法,但是没有用。您只能将每个字符串拆分为一列中的列表,而不能拆分为多列

我该怎么办?

3 个答案:

答案 0 :(得分:1)

如果您有不同数量的定界符,而不是每行只有3个,则可以使用以下内容:

输入:

+-------+
|value  |
+-------+
|a,b,c  |
|d,e,f,g|
+-------+

解决方案

import pyspark.sql.functions as F

max_size = df.select(F.max(F.length(F.regexp_replace('value','[^,]','')))).first()[0]
out = df.select([F.split("value",',')[x].alias(f"Col{x+1}") for x in range(max_size+1)])

输出

out.show()

+----+----+----+----+
|Col1|Col2|Col3|Col4|
+----+----+----+----+
|   a|   b|   c|null|
|   d|   e|   f|   g|
+----+----+----+----+

答案 1 :(得分:0)

Split 值列,并通过访问create new columns.array index(或)element_at(from spark-2.4)(或)getItem()函数来< / p>


from pyspark.sql.functions import *

lines.withColumn("tmp",split(col("value"),',')).\
withColumn("col1",col("tmp")[0]).\
withColumn("col2",col("tmp").getItem(1)).\
withColumn("col3",element_at(col("tmp"),3))
drop("tmp","value").\
show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|   a|   b|   c|
#+----+----+----+

答案 2 :(得分:0)

from pyspark.sql.functions import *
import pyspark.sql.functions as f
from pyspark.sql import SparkSession

spark_session = SparkSession\
    .builder\
    .appName("test")\
    .getOrCreate()

lines = spark_session\
    .readStream\
    .format("socket")\
    .option("host", "127.0.0.1")\
    .option("port", 9998)\
    .load()

split_col = f.split(lines['value'], ",")
df = df.withColumn('col1', split_col.getItem(0))
df = df.withColumn('col2', split_col.getItem(1))
df = df.withColumn('col2', split_col.getItem(2))

df.show()