Spark sql查询返回StringType而不是ArrayType?

时间:2016-11-09 14:41:07

标签: apache-spark-sql spark-dataframe

在我的spark.sql查询期间尝试应用我的UDF时,查询只返回一个看起来像我的数组的长字符串,而不是以数组形式返回我清理过的单词。这在尝试应用CountVectorizer时给出了错误。它引发的错误是'requirement failed: Column cleanedWords must be of type equal to one of the following types: [ArrayType(StringType,true), ArrayType(StringType,false)] but was actually of type StringType.'

这是我的代码:

from string import punctuation
from hebrew import stop_words
hebrew_stopwords = stop_words()

def removepuncandstopwords(listofwords):
    newlistofwords = []
    for word in listofwords:
        if word not in hebrew_stopwords:
            for punc in punctuation:
                word = word.strip(punc)
            newlistofwords.append(word)
    return newlistofwords

from pyspark.ml.feature import CountVectorizer, IDF, Tokenizer, Normalizer
from pyspark.sql.types import ArrayType, StringType

sqlctx.udf.register("removepuncandstopwords", removepuncandstopwords, ArrayType(StringType()))

sentenceData = spark.createDataFrame([
    (0, "Hello my friend; i am sam"),
    (1, "Hello, my name is sam")
], ["label", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
wordsData.registerTempTable("wordsData")
wordsDataCleaned = spark.sql("select label, sentence, words, removepuncandstopwords(words) as cleanedWords from wordsData")



wordsDataCleaned[['cleanedWords']].rdd.take(2)[0]
Out[163]:
Row(cleanedWords='[hello, my, friend, i, am, sam]')

我该如何解决这个问题?

1 个答案:

答案 0 :(得分:2)

所以我也遇到了这个错误。因此,文档需要数据结构的方式是

cleanedWords=['hello', 'my', 'friend', 'is', 'sam']

然而,你的似乎有所不同。所以不是这个

sentenceData = spark.createDataFrame([
(0, "Hello my friend; i am sam"),
(1, "Hello, my name is sam")],
["label", "sentence"])

我相信应该是这个

documentDF = spark.createDataFrame([
(0, "Hello my friend; i am sam".split(" "), ),
(1, "Hello, my name is sam".split(" "),],
["label", "sentence"])

资料来源:我正在关闭他们构建代码的文档,如

documentDF = spark.createDataFrame([
("Hi I heard about Spark".split(" "), ),
("I wish Java could use case classes".split(" "), ),
("Logistic regression models are neat".split(" "), )
], ["text"])

链接 - https://spark.apache.org/docs/2.1.0/ml-features.html#word2vec