我想在sparkStreaming中读取本地文件,但是我遇到了问题。我无法获取数据。我首先启动sparkStreaming然后将一个新文件移动到dataDirector,但它也不起作用。并且程序中没有错误。我在linux中尝试过本地和集群模式,它也不起作用。但是当HDFS dataDirector获取新文件时它可以工作。 SomeOne谁可以解雇我?谢谢!然后是我的代码。
public class SparkStreamingTest {
public static void main(String[] args) throws Exception {
SparkConf conf = new SparkConf();
conf.setMaster("local[2]")
//.setMaster("spark://192.168.170.135:7077")
.setAppName("testSparkStreaming");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(5));
//JavaDStream<String> lines = jssc.textFileStream("file:///home/sparkStreaming/data");
JavaDStream<String> lines = jssc.textFileStream("e:\\testFile");
//JavaDStream<String> lines = jssc.textFileStream("hdfs://192.168.170.135:9000/sparkStreaming/data");
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
@Override
public Iterator<String> call(String line) throws Exception {
// TODO Auto-generated method stub
return Arrays.asList(line.split(" ")).iterator();
}
});
lines.print();
jssc.start();
jssc.awaitTermination();
我启动程序然后将一个新文件移动到dataDirector,以下是log.No数据可用,火花流的textFileStream方法是否不支持读取本地文件?
14:21:55.001 [JobGenerator] DEBUG o.a.s.s.dstream.FileInputDStream - Time 1502086915000 ms is valid
14:21:55.001 [JobGenerator] DEBUG o.a.s.s.dstream.FileInputDStream - Getting new files for time 1502086915000, ignoring files older than 1502086855000
14:21:55.002 [JobGenerator] DEBUG o.a.s.s.dstream.FileInputDStream - file:/e:/testFile/test2.txt ignored as mod time 1501816119160 <= ignore time 1502086855000
14:21:55.002 [JobGenerator] INFO o.a.s.s.dstream.FileInputDStream - Finding new files took 1 ms
14:21:55.002 [JobGenerator] DEBUG o.a.s.s.dstream.FileInputDStream - # cached file times = 1
14:21:55.002 [JobGenerator] INFO o.a.s.s.dstream.FileInputDStream - New files at time 1502086915000 ms:
14:28:40.009 [JobGenerator] DEBUG org.apache.spark.util.ClosureCleaner - +++ Cleaning closure <function1> (org.apache.spark.streaming.StreamingContext$$anonfun$textFileStream$1$$anonfun$apply$2) +++
14:28:40.010 [JobGenerator] DEBUG org.apache.spark.util.ClosureCleaner - + declared fields: 1
14:28:40.010 [JobGenerator] DEBUG org.apache.spark.util.ClosureCleaner - public static final long org.apache.spark.streaming.StreamingContext$$anonfun$textFileStream$1$$anonfun$apply$2.serialVersionUID
14:28:40.010 [JobGenerator] DEBUG org.apache.spark.util.ClosureCleaner - + declared methods: 2
14:28:40.010 [JobGenerator] DEBUG org.apache.spark.util.ClosureCleaner - public final java.lang.Object org.apache.spark.streaming.StreamingContext$$anonfun$textFileStream$1$$anonfun$apply$2.apply(java.lang.Object)
14:28:40.010 [JobGenerator] DEBUG org.apache.spark.util.ClosureCleaner - public final java.lang.String org.apache.spark.streaming.StreamingContext$$anonfun$textFileStream$1$$anonfun$apply$2.apply(scala.Tuple2)
14:28:40.010 [JobGenerator] DEBUG org.apache.spark.util.ClosureCleaner - + inner classes: 0
14:28:40.010 [JobGenerator] DEBUG org.apache.spark.util.ClosureCleaner - + outer classes: 0
14:28:40.010 [JobGenerator] DEBUG org.apache.spark.util.ClosureCleaner - + outer objects: 0
14:28:40.011 [JobGenerator] DEBUG org.apache.spark.util.ClosureCleaner - + populating accessed fields because this is the starting closure
14:28:40.012 [JobGenerator] DEBUG org.apache.spark.util.ClosureCleaner - + fields accessed by starting closure: 0
14:28:40.012 [JobGenerator] DEBUG org.apache.spark.util.ClosureCleaner - + there are no enclosing objects!
14:28:40.012 [JobGenerator] DEBUG org.apache.spark.util.ClosureCleaner - +++ closure <function1> (org.apache.spark.streaming.StreamingContext$$anonfun$textFileStream$1$$anonfun$apply$2) is now cleaned +++
14:28:40.013 [JobGenerator] DEBUG o.a.spark.streaming.DStreamGraph - Generated 1 jobs for time 1502087320000 ms
-------------------------------------------
Time: 1502087320000 ms
-------------------------------------------
14:28:40.014 [JobGenerator] INFO o.a.s.s.scheduler.JobScheduler - Added jobs for time 1502087320000 ms