将列表列表转换为Spark Dataframe

时间:2019-06-11 09:45:17

标签: apache-spark pyspark apache-spark-sql

我的交易数据“数据”为:

 [ ["a","e","l"],["f","a","e","m","n"], ...]

每个子列表代表单个事务。没有标题。我正在尝试使用pyspark运行FPGrowth算法。

我尝试了以下方法:

from pyspark.ml.fpm import FPGrowth

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate();
spark = SparkSession(sc)

df = spark.createDataFrame(data,["items"])
print("1.Here")
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.9, minConfidence=0.9)

model = fpGrowth.fit(df)

# Display frequent itemsets.
model.freqItemsets.show()

# Display generated association rules.
model.associationRules.show()


# transform examines the input items against all the association rules and summarize the
# consequents as prediction
model.transform(df).show()```

我得到的错误是:

IllegalArgumentException: 
'requirement failed: The input column must be array, but got string.'

1 个答案:

答案 0 :(得分:0)

DataFrame的解释不正确,请尝试使用以下格式的数据:

data = [ (["a","e","l"],),(["f","a","e","m","n"],) ]