我有一个格式如下的数据框:
id text
1 Amy How are you today? Smile
2 Sam Not very well. Sad
我想生成一个格式如下的新框架:
id Name Content Expression
1 Amy How are you today? Smile
2 Sam Not very well. Sad
为此,我计划首先拆分文本列:
cols = F.split(df['text'], ' ')
df = df.withColumn('Name', cols.getItem(0))
但我如何获得内容和表达?我可以使用cols.getItem(-1)
来获取文本的最后一个元素吗?如何在cols
中加入cols [1:-1](第二个元素到最后一个元素)以形成新列content
?
我调查数据实际上不能保证句子的双引号。唯一可以依赖的是空间分割。
答案 0 :(得分:2)
给定输入 dataframe ,架构为
+---+----------------------------+
|id |text |
+---+----------------------------+
|1 |Amy How are you today? Smile|
|2 |Sam Not very well. Sad |
+---+----------------------------+
root
|-- id: long (nullable = true)
|-- text: string (nullable = true)
您只需使用以下udf
功能即可满足您的要求
from pyspark.sql import functions as f
from pyspark.sql import types as t
@f.udf(t.StructType([t.StructField("Name", t.StringType(), True), t.StructField("Content", t.StringType(), True), t.StructField("Expression", t.StringType(), True)]))
def splitCols(array):
return (array[0], ' '.join(array[1:len(array)-1]), array[len(array)-1])
df.withColumn('text', splitCols(f.split('text', ' ')))\
.select(f.col('id'), f.col('text.*'))\
.show(truncate=False)
应该给你
+---+----+------------------+----------+
|id |Name|Content |Expression|
+---+----+------------------+----------+
|1 |Amy |How are you today?|Smile |
|2 |Sam |Not very well. |Sad |
+---+----+------------------+----------+
答案 1 :(得分:1)
让UDF执行此操作可能更清晰。但你也可以用Spark functions来解决这个问题。
df\
.withColumn("Name", split(col("text"), " ").getItem(0))\
.withColumn("Content", regexp_extract(col("text"), "[a-zA-Z0-9]+ (.*) [a-zA-Z0-9]+", 1)) \
.withColumn("Expression", reverse(split(reverse(col("text")), " ").getItem(0))) \
.show()
.+---+--------------------+----+------------------+----------+
| id| text|Name| Content|Expression|
+---+--------------------+----+------------------+----------+
| 1|Amy How are you t...| Amy|How are you today?| Smile|
| 2|Sam Not very well...| Sam| Not very well.| Sad|
+---+--------------------+----+------------------+----------+
答案 2 :(得分:0)
spltText = udf(lambda data:[ele.strip() for ele in data.split('"')], ArrayType(StringType()))
spltDF = df.withColumn("spltData",spltText(df.text))
spltDF = (spltDF.withColumn('Name', spltDF.spltData.getItem(0))
.withColumn('Content', spltDF.spltData.getItem(1))
.withColumn('Expression', spltDF.spltData.getItem(2)))
spltDF.select('id','Name','Content','Expression').show()
+---+----+------------------+----------+
| id|Name| Content|Expression|
+---+----+------------------+----------+
| 1| Amy|How are you today?| (Smile)|
| 2| Sam| Not very well.| (Sad)|
+---+----+------------------+----------+