我编写了这个示例程序来将RDD保存到avro文件中。
我正在使用CDH 5.4和Spark 1.3
我编写了这个avsc文件,然后为类User
生成了代码{"namespace": "com.abhi",
"type": "record",
"name": "User",
"fields": [
{"name": "firstname", "type": "string"},
{"name": "lastname", "type": "string"} ]
}
然后我为User
生成了代码java -jar ~/Downloads/avro-tools-1.7.7.jar compile schema User.avsc .
我写了我的例子
package com.abhi
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.SparkConf
import org.apache.avro.generic.GenericRecord
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.{AvroKeyOutputFormat, AvroJob, AvroKeyInputFormat}
import org.apache.hadoop.io.NullWritable
import org.apache.spark.SparkContext
object MySpark {
def main(args : Array[String]) : Unit = {
val sf = new SparkConf()
.setMaster("local[2]")
.setAppName("MySpark")
val sc = new SparkContext(sf)
val user1 = new User();
user1.setFirstname("Test1");
user1.setLastname("Test2");
val user2 = new User("Test3", "Test4");
// Construct via builder
val user3 = User.newBuilder()
.setFirstname("Test5")
.setLastname("Test6")
.build()
val list = Array(user1, user2, user3)
val userRdd = sc.parallelize(list)
val job: Job = Job.getInstance()
AvroJob.setOutputKeySchema(job, user1.getSchema)
val output = "/user/cloudera/users.avro"
userRdd.map(row => (new AvroKey(row), NullWritable.get()))
.saveAsNewAPIHadoopFile(
output,
classOf[AvroKey[User]],
classOf[NullWritable],
classOf[AvroKeyOutputFormat[User]],
job.getConfiguration)
}
}
我对此代码有两个顾虑
部分导入来自旧的mapreduce api,我想知道为什么Spark代码需要它们
import org.apache.hadoop.mapreduce.Job
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.{AvroKeyOutputFormat, AvroJob,
AvroKeyInputFormat}
当我将代码提交给hadoop集群时,代码会抛出异常 它确实在HDFS
中创建一个名为/user/cloudera/users.avro的空目录15/11/01 08:20:42 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
15/11/01 08:20:42 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/11/01 08:20:42 INFO spark.SparkContext: Starting job: saveAsNewAPIHadoopFile at MySpark.scala:52
15/11/01 08:20:42 INFO scheduler.DAGScheduler: Got job 1 (saveAsNewAPIHadoopFile at MySpark.scala:52) with 2 output partitions (allowLocal=false)
15/11/01 08:20:42 INFO scheduler.DAGScheduler: Final stage: Stage 1(saveAsNewAPIHadoopFile at MySpark.scala:52)
15/11/01 08:20:42 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/11/01 08:20:42 INFO scheduler.DAGScheduler: Missing parents: List()
15/11/01 08:20:42 INFO scheduler.DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[2] at map at MySpark.scala:51), which has no missing parents
15/11/01 08:20:42 INFO storage.MemoryStore: ensureFreeSpace(66904) called with curMem=301745, maxMem=280248975
15/11/01 08:20:42 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 65.3 KB, free 266.9 MB)
15/11/01 08:20:42 INFO storage.MemoryStore: ensureFreeSpace(23066) called with curMem=368649, maxMem=280248975
15/11/01 08:20:42 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 22.5 KB, free 266.9 MB)
15/11/01 08:20:42 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:34630 (size: 22.5 KB, free: 267.2 MB)
15/11/01 08:20:42 INFO storage.BlockManagerMaster: Updated info of block broadcast_2_piece0
15/11/01 08:20:42 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:839
15/11/01 08:20:42 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 1 (MapPartitionsRDD[2] at map at MySpark.scala:51)
15/11/01 08:20:42 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
15/11/01 08:20:42 ERROR scheduler.TaskSetManager: Failed to serialize task 1, not attempting to retry it.
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.getObjFieldValues$extension(SerializationDebugger.scala:240)
at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:150)
at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:99)
at org.apache.spark.serializer.SerializationDebugger$.find(SerializationDebugger.scala:58)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:39)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)
at org.apache.spark.scheduler.Task$.serializeWithDependencies(Task.scala:149)
at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:464)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:232)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.scheduler.TaskSchedulerImpl.org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:227)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$6.apply(TaskSchedulerImpl.scala:296)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$6.apply(TaskSchedulerImpl.scala:294)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
答案 0 :(得分:0)
问题是Spark无法序列化您的import RPi.GPIO as GPIO
actionpin1 = 23
actionpin2 = 24
GPIO.setmode(GPIO.BCM)
GPIO.setup(actionpin1, GPIO.IN, pull_up_down=GPIO.PUD_UP)
GPIO.setup(actionpin2, GPIO.IN, pull_up_down=GPIO.PUD_UP)
GPIO.add_event_detect(actionpin1, GPIO.BOTH, callback=action1,bouncetime=800)
GPIO.add_event_detect(actionpin2, GPIO.BOTH, callback=action2, bouncetime=800)
def action1():
print "button pressed 1"
def action2():
print "button pressed 2"
while True:
print "waiting for button"
课程,请尝试设置User
并在那里注册您的课程。