应用错误收集

我正在使用java和spark从常见爬网中提取数据。我想知道是否有一种使用apache spark将匹配与正则表达式并行化的方法。例如，我有以下代码：

for (S3ObjectSummary s : summary) {
        if (s.getKey().matches(".*robotstxt.*")) {
            rdd = sc.textFile("s3n://" + accessKey + ":" + secretKey + 
            "@commoncrawl/" + s.getKey());

在apache spark文档中，我看到它可用于执行以下操作：

JavaRDD<String> textFile = sc.textFile("hdfs://...");
JavaPairRDD<String, Integer> counts = textFile
 .flatMap(s -> Arrays.asList(s.split(" ")).iterator())
 .mapToPair(word -> new Tuple2<>(word, 1))
 .reduceByKey((a, b) -> a + b);
counts.saveAsTextFile("hdfs://...");

我只需要查看文本文件中的行何时与正则表达式匹配。例如：

if (line.matches(regex))
   ....

以及所有与匹配项匹配的匹配项，我想将其打印在文本文件中。

有没有办法使与火花匹配的模式并行化？

0 个答案: