在PySpark中将Row转换为List(String)

时间:2018-01-19 12:22:33

标签: apache-spark pyspark pyspark-sql

我有Row元组格式的数据 -

Row(Sentence=u'When, for the first time I realized the meaning of death.')

我想将它转换为String格式,如下所示 -

(u'When, for the first time I realized the meaning of death.')

我试过这样的事情(假设' a'在Row tupple中有数据) -

b = sc.parallelize(a)
b = b.map(lambda line: tuple([str(x) for x in line]))
print(b.take(4))

但我得到的结果是这样的 -

[('W', 'h', 'e', 'n', ',', ' ', 'f', 'o', 'r', ' ', 't', 'h', 'e', ' ', 'f', 'i', 'r', 's', 't', ' ', 't', 'i', 'm', 'e', ' ', 'I', ' ', 'r', 'e', 'a', 'l', 'i', 'z', 'e', 'd', ' ', 't', 'h', 'e', ' ', 'm', 'e', 'a', 'n', 'i', 'n', 'g', ' ', 'o', 'f', ' ', 'd', 'e', 'a', 't', 'h', '.')]

有人知道我在做错了吗?

2 个答案:

答案 0 :(得分:4)

Row(你为什么甚至......)应该是:

a = Row(Sentence=u'When, for the first time I realized the meaning of death.')

b = sc.parallelize([a])

并用

展平
b.map(lambda x: x.Sentence)

b.flatMap(lambda x: x)

虽然sc.parallelize(a)已经采用您需要的格式 - 因为您传递Iterable,Spark会遍历Row中的所有字段以创建RDD

答案 1 :(得分:0)

下面是代码:

col = 'your_column_name'
val = df.select(col).collect()
val2 = [ ele.__getattr__(col) for ele in val]