如果我的数据框中包含字段['是',' doc'],例如
data = sc.parallelize(['This is a test',
'This is also a test',
'These sentence are tests',
'This tests these sentences'])\
.zipWithIndex()\
.map(lambda x: (x[1],x[0]))\
.toDF(['did','doc'])
data.show()
+---+--------------------+--------------------+
|did| doc| words|
+---+--------------------+--------------------+
| 0| This is a test| [this, is, a, test]|
| 1| This is also a test|[this, is, also, ...|
| 2|These sentence ar...|[these, sentence,...|
| 3|This tests these ...|[this, tests, the...|
+---+--------------------+--------------------+
我在该文档上进行了一些转换,例如标记和查找2-gram:
data = Tokenizer(inputCol = 'doc',outputCol = 'words').transform(data)
data = NGram(n=2,inputCol = 'words',outputCol='grams').transform(data)
data.show()
+---+--------------------+--------------------+--------------------+
|did| doc| words| grams|
+---+--------------------+--------------------+--------------------+
| 0| This is a test| [this, is, a, test]|[this is, is a, a...|
| 1| This is also a test|[this, is, also, ...|[this is, is also...|
| 2|These sentence ar...|[these, sentence,...|[these sentence, ...|
| 3|This tests these ...|[this, tests, the...|[this tests, test...|
+---+--------------------+--------------------+--------------------+
然后在最后我想用VectorAssembler将两克和单词组合成一列特征:
data = VectorAssembler(inputCol=['words','grams'],
outputCol='features').transform(data)
然后我收到以下错误:
Py4JJavaError: An error occurred while calling o504.transform.
: java.lang.IllegalArgumentException: Data type ArrayType(StringType,true) is not supported.
因为VectorAssembler不喜欢使用字符串列表。为了解决这个问题,我可以将数据帧放到rdd,将rdd映射到适当的行,然后将其重新压缩回数据帧,la
data = data.rdd.map(lambda x: Row(did = x['did'],
features = x['words']+x['grams'])) .toDF(['did','features'])
这对于这个小数据集来说不是问题,但对于大型数据集来说这是非常昂贵的。
有没有办法比上述更有效地实现这一目标?
答案 0 :(得分:0)
您可以使用udf创建这样的功能列
import pyspark.sql.functions as f
import pyspark.sql.types as t
udf_add = f.udf(lambda x,y: x+y, t.ArrayType(t.StringType()))
data.withColumn('features', udf_add('words','grams')).select('features').show()
[Row(features=['this', 'is', 'a', 'test', 'this is', 'is a', 'a test']),
Row(features=['this', 'is', 'also', 'a', 'test', 'this is', 'is also', 'also a', 'a test']),
Row(features=['these', 'sentence', 'are', 'tests', 'these sentence', 'sentence are', 'are tests']),
Row(features=['this', 'tests', 'these', 'sentences', 'this tests', 'tests these', 'these sentences'])]