Question

我在PySpark中有一个数据表，其中包含两列，数据类型为'struc'。

请参见下面的示例数据框：

word_verb                   word_noun
{_1=cook, _2=VB}            {_1=chicken, _2=NN}
{_1=pack, _2=VBN}           {_1=lunch, _2=NN}
{_1=reconnected, _2=VBN}    {_1=wifi, _2=NN}

我想将两列连接在一起，以便对连接的动词和名词块进行频率计数。

我尝试了以下代码：

df = df.withColumn('word_chunk_final', F.concat(F.col('word_verb'), F.col('word_noun')))

但是出现以下错误：

AnalysisException: u"cannot resolve 'concat(`word_verb`, `word_noun`)' due to data type mismatch: input to function concat should have been string, binary or array, but it's [struct<_1:string,_2:string>, struct<_1:string,_2:string>]

我想要的输出表如下。串联的新字段的数据类型为字符串：

word_verb                   word_noun               word_chunk_final
{_1=cook, _2=VB}            {_1=chicken, _2=NN}     cook chicken
{_1=pack, _2=VBN}           {_1=lunch, _2=NN}       pack lunch
{_1=reconnected, _2=VBN}    {_1=wifi, _2=NN}        reconnected wifi

Answer 1

您的代码几乎在那里。

假设您的架构如下：

df.printSchema()
#root
# |-- word_verb: struct (nullable = true)
# |    |-- _1: string (nullable = true)
# |    |-- _2: string (nullable = true)
# |-- word_noun: struct (nullable = true)
# |    |-- _1: string (nullable = true)
# |    |-- _2: string (nullable = true)

您只需要访问每一列的_1字段的值：

import pyspark.sql.functions as F

df.withColumn(
    "word_chunk_final", 
    F.concat_ws(' ', F.col('word_verb')['_1'], F.col('word_noun')['_1'])
).show()
#+-----------------+------------+----------------+
#|        word_verb|   word_noun|word_chunk_final|
#+-----------------+------------+----------------+
#|        [cook,VB]|[chicken,NN]|    cook chicken|
#|       [pack,VBN]|  [lunch,NN]|      pack lunch|
#|[reconnected,VBN]|   [wifi,NN]|reconnected wifi|
#+-----------------+------------+----------------+

此外，您应该使用concat_ws（“用分隔符连接”）而不是concat来添加字符串以及它们之间的空格。类似于str.join在python中的工作方式。

PySpark：连接数据类型为“ Struc”的两列->错误：由于数据类型不匹配而无法解析

1 个答案: