总体问题:从csv文件制作架构并将其应用于数据文件。 我有一个带有一列的RDD,我想从中创建一个字符串。所以我使用下面的代码来做到这一点,它在pyspark Interactive Shell中工作正常但在spark工作中失败。
schema = metadata.map(lambda l: l).reduce(lambda l, m: l+ "," + m)
因此输出应该像'id,name,age'。但是当我执行这项工作时,我得到了错误:
Exception: It appears that you are attempting to broadcast an RDD or reference
an RDD from an action or transformation. RDD transformations and actions can
only be invoked by the driver, not inside of other transformations; for
example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the
values transformation and count action cannot be performed inside of the
rdd1.map transformation. For more information, see SPARK-5063.
投入完整的火花工作:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
# create configuration for this job
conf = SparkConf().setAppName('DFtest')
from pyspark import SparkConf, SparkContext
# create spark context for the job
from pyspark import SparkConf, SparkContext
sc = SparkContext(conf=conf)
from pyspark import SparkConf, SparkContext
# create a sqlContext for Data Frame operations
from pyspark import SparkConf, SparkContext
sqlContext = SQLContext(sc)
from pyspark import SparkConf, SparkContext
metadata_input = "file:/home/mapr/metadata/loans_metadata.csv"
data_input = "/user/bedrock/data/loans_incr/loans-2016-02-24-06-00-00-2016-02
25-06-00-00-864ca30f-097f-4234-87bc-7f1a7d57aa7e.csv"
metadata = sc.textFile(metadata_input)
header = metadata.filter(lambda l: "Technical Name" in l)
metadata = metadata.filter(lambda l: l != header)
metadata = metadata.map(lambda l: l.split(",")[0]
schema = metadata.map(lambda l: l).reduce(lambda l, m: l+ "," + m)
fields = [StructField(field_name, StringType(), True) for field_name in
schema.split(",")]
finalSchema = StructType(fields)
data = sc.textFile(data_input)
df = sqlContext.createDataFrame(data, finalSchema)
df.show()
sc.stop()
我查看了有关此错误的其他帖子,但无法理解这是一个嵌套的Map。我理解一点,这可以通过“广播”来解决,但不知道如何解决。请指教。
答案 0 :(得分:2)
问题在于:
header = metadata.filter(lambda l: "Technical Name" in l)
header
是RDD
,不是本地对象。首先尝试将其提取给驱动程序:
header = metadata.filter(lambda l: "Technical Name" in l).first()
尽管如此:
metadata.filter(lambda l: "Technical Name" not in l)
如果您只期望发生一次Technical Name
,应具有相同的效果。