PySpark - 拆分字符串列并将其中的一部分连接起来以形成新列

时间:2018-05-07 21:00:58

标签: apache-spark pyspark apache-spark-sql

我有一个格式如下的数据框:

id    text
1     Amy How are you today? Smile
2     Sam Not very well. Sad

我想生成一个格式如下的新框架:

id    Name    Content              Expression
1     Amy     How are you today?   Smile
2     Sam     Not very well.       Sad

为此,我计划首先拆分文本列:

cols = F.split(df['text'], ' ')
df = df.withColumn('Name', cols.getItem(0))

但我如何获得内容和表达?我可以使用cols.getItem(-1)来获取文本的最后一个元素吗?如何在cols中加入cols [1:-1](第二个元素到最后一个元素)以形成新列content

我调查数据实际上不能保证句子的双引号。唯一可以依赖的是空间分割。

3 个答案:

答案 0 :(得分:2)

给定输入 dataframe ,架构为

+---+----------------------------+
|id |text                        |
+---+----------------------------+
|1  |Amy How are you today? Smile|
|2  |Sam Not very well. Sad      |
+---+----------------------------+
root
 |-- id: long (nullable = true)
 |-- text: string (nullable = true)

您只需使用以下udf功能即可满足您的要求

from pyspark.sql import functions as f
from pyspark.sql import types as t

@f.udf(t.StructType([t.StructField("Name", t.StringType(), True), t.StructField("Content", t.StringType(), True), t.StructField("Expression", t.StringType(), True)]))
def splitCols(array):
    return (array[0], ' '.join(array[1:len(array)-1]), array[len(array)-1])

df.withColumn('text', splitCols(f.split('text', ' ')))\
    .select(f.col('id'), f.col('text.*'))\
    .show(truncate=False)

应该给你

+---+----+------------------+----------+
|id |Name|Content           |Expression|
+---+----+------------------+----------+
|1  |Amy |How are you today?|Smile     |
|2  |Sam |Not very well.    |Sad       |
+---+----+------------------+----------+

答案 1 :(得分:1)

让UDF执行此操作可能更清晰。但你也可以用Spark functions来解决这个问题。

df\
    .withColumn("Name", split(col("text"), " ").getItem(0))\
    .withColumn("Content", regexp_extract(col("text"), "[a-zA-Z0-9]+ (.*) [a-zA-Z0-9]+", 1)) \
    .withColumn("Expression", reverse(split(reverse(col("text")), " ").getItem(0))) \
    .show()


.+---+--------------------+----+------------------+----------+
| id|                text|Name|           Content|Expression|
+---+--------------------+----+------------------+----------+
|  1|Amy How are you t...| Amy|How are you today?|     Smile|
|  2|Sam Not very well...| Sam|    Not very well.|       Sad|
+---+--------------------+----+------------------+----------+

答案 2 :(得分:0)

spltText = udf(lambda data:[ele.strip() for ele in data.split('"')], ArrayType(StringType()))
spltDF = df.withColumn("spltData",spltText(df.text))
spltDF = (spltDF.withColumn('Name', spltDF.spltData.getItem(0))
                .withColumn('Content', spltDF.spltData.getItem(1))
                .withColumn('Expression', spltDF.spltData.getItem(2)))
spltDF.select('id','Name','Content','Expression').show()

+---+----+------------------+----------+
| id|Name|           Content|Expression|
+---+----+------------------+----------+
|  1| Amy|How are you today?|   (Smile)|
|  2| Sam|    Not very well.|     (Sad)|
+---+----+------------------+----------+