在我的spark.sql查询期间尝试应用我的UDF时,查询只返回一个看起来像我的数组的长字符串,而不是以数组形式返回我清理过的单词。这在尝试应用CountVectorizer时给出了错误。它引发的错误是'requirement failed: Column cleanedWords must be of type equal to one of the following types: [ArrayType(StringType,true), ArrayType(StringType,false)] but was actually of type StringType.'
这是我的代码:
from string import punctuation
from hebrew import stop_words
hebrew_stopwords = stop_words()
def removepuncandstopwords(listofwords):
newlistofwords = []
for word in listofwords:
if word not in hebrew_stopwords:
for punc in punctuation:
word = word.strip(punc)
newlistofwords.append(word)
return newlistofwords
from pyspark.ml.feature import CountVectorizer, IDF, Tokenizer, Normalizer
from pyspark.sql.types import ArrayType, StringType
sqlctx.udf.register("removepuncandstopwords", removepuncandstopwords, ArrayType(StringType()))
sentenceData = spark.createDataFrame([
(0, "Hello my friend; i am sam"),
(1, "Hello, my name is sam")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
wordsData.registerTempTable("wordsData")
wordsDataCleaned = spark.sql("select label, sentence, words, removepuncandstopwords(words) as cleanedWords from wordsData")
wordsDataCleaned[['cleanedWords']].rdd.take(2)[0]
Out[163]:
Row(cleanedWords='[hello, my, friend, i, am, sam]')
我该如何解决这个问题?
答案 0 :(得分:2)
所以我也遇到了这个错误。因此,文档需要数据结构的方式是
cleanedWords=['hello', 'my', 'friend', 'is', 'sam']
然而,你的似乎有所不同。所以不是这个
sentenceData = spark.createDataFrame([
(0, "Hello my friend; i am sam"),
(1, "Hello, my name is sam")],
["label", "sentence"])
我相信应该是这个
documentDF = spark.createDataFrame([
(0, "Hello my friend; i am sam".split(" "), ),
(1, "Hello, my name is sam".split(" "),],
["label", "sentence"])
资料来源:我正在关闭他们构建代码的文档,如
documentDF = spark.createDataFrame([
("Hi I heard about Spark".split(" "), ),
("I wish Java could use case classes".split(" "), ),
("Logistic regression models are neat".split(" "), )
], ["text"])
链接 - https://spark.apache.org/docs/2.1.0/ml-features.html#word2vec