Question

我正在搜索一些数据文件（~20GB）。我想在该数据中找到一些特定术语并标记匹配的偏移量。有没有办法让Spark识别我正在操作的数据块的偏移量？

import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;

import java.util.regex.*;

public class Grep {
        public static void main( String args[] ) {
            SparkConf        conf       = new SparkConf().setMaster( "spark://ourip:7077" );
            JavaSparkContext jsc        = new JavaSparkContext( conf );
            JavaRDD<String>  data       = jsc.textFile( "hdfs://ourip/test/testdata.txt" ); // load the data from HDFS
            JavaRDD<String>  filterData = data.filter( new Function<String, Boolean>() {
                    // I'd like to do something here to get the offset in the original file of the string "babe ruth"
                    public Boolean call( String s ) { return s.toLowerCase().contains( "babe ruth" ); } // case insens matching

            });

            long matches = filterData.count();  // count the hits

            // execute the RDD filter
            System.out.println( "Lines with search terms: " + matches );
 );
        } //  end main
} // end class Grep

我想在“过滤器”操作中做一些事情来计算原始文件中“babe ruth”的偏移量。我可以在当前行中获得“babe ruth”的偏移量，但是什么过程或函数告诉我文件中行的偏移量？

Answer 1

在Spark中，可以使用常见的 Hadoop输入格式。要从文件中读取字节偏移量，可以使用Hadoop中的类TextInputFormat（ org.apache.hadoop.mapreduce.lib.input ）。它已与Spark捆绑在一起。

它会将文件读取为键（字节偏移）和值（文本行）：

纯文本文件的InputFormat。文件分为几行。换行或回车用于发出行尾信号。键是文件中的位置，值是文本行。

在Spark中，可以通过调用newAPIHadoopFile()

来使用它

SparkConf conf = new SparkConf().setMaster("");
JavaSparkContext jsc = new JavaSparkContext(conf);

// read the content of the file using Hadoop format
JavaPairRDD<LongWritable, Text> data = jsc.newAPIHadoopFile(
        "file_path", // input path
        TextInputFormat.class, // used input format class
        LongWritable.class, // class of the value
        Text.class, // class of the value
        new Configuration());    

JavaRDD<String> mapped = data.map(new Function<Tuple2<LongWritable, Text>, String>() {
    @Override
    public String call(Tuple2<LongWritable, Text> tuple) throws Exception {
        // you will get each line from as a tuple (offset, text)    
        long pos = tuple._1().get(); // extract offset
        String line = tuple._2().toString(); // extract text

        return pos + " " + line;
    }
});

Answer 2

您可以使用wholeTextFiles(String path, int minPartitions)中的JavaSparkContext方法返回JavaPairRDD<String,String>，其中键是文件名，值是包含文件整个内容的字符串（因此，每条记录在这个RDD代表一个文件）。在这里，只需运行map()即可在每个值上调用indexOf(String searchString)。这应该返回每个文件中的第一个索引，并出现相关的字符串。

（编辑：）

因此，可以在一个文件的分布式方式中找到偏移量（根据评论中的下面的用例）。下面是一个适用于Scala的示例。

val searchString = *search string*
val rdd1 = sc.textFile(*input file*, *num partitions*)

// Zip RDD lines with their indices
val zrdd1 = rdd1.zipWithIndex()

// Find the first RDD line that contains the string in question
val firstFind = zrdd1.filter { case (line, index) => line.contains(searchString) }.first()

// Grab all lines before the line containing the search string and sum up all of their lengths (and then add the inline offset)
val filterLines = zrdd1.filter { case (line, index) => index < firstFind._2 }
val offset = filterLines.map { case (line, index) => line.length }.reduce(_ + _) + firstFind._1.indexOf(searchString)

请注意，您还需要手动添加任何新的行字符，因为它们没有被考虑（输入格式使用新行作为记录之间的分界）。新行的数量只是包含搜索字符串的行之前的行数，因此这很容易添加。

我遗憾的是，我并不完全熟悉Java API，并且测试起来并不容易，所以我不确定下面的代码是否有效但是还有它（另外，我使用的是Java） 1.7但是1.8用lambda表达式压缩了很多代码。）：

String searchString = *search string*;
JavaRDD<String> data = jsc.textFile("hdfs://ourip/test/testdata.txt");

JavaRDD<Tuple2<String, Long>> zrdd1 = data.zipWithIndex();

Tuple2<String, Long> firstFind = zrdd1.filter(new Function<Tuple2<String, Long>, Boolean>() {
      public Boolean call(Tuple2<String, Long> input) { return input.productElement(0).contains(searchString); }
  }).first();

JavaRDD<Tuple2<String, Long>> filterLines = zrdd1.filter(new Function<Tuple2<String, Long>, Boolean>() {
      public Boolean call(Tuple2<String, Long> input) { return input.productElement(1) < firstFind.productElement(1); }
  });

Long offset = filterLines.map(new Function<Tuple2<String, Long>, Int>() {
      public Int call(Tuple2<String, Long> input) { return input.productElement(0).length(); }
  }).reduce(new Function2<Integer, Integer, Integer>() {
      public Integer call(Integer a, Integer b) { return a + b; }
  }) + firstFind.productElement(0).indexOf(searchString);

这只能在您的输入为一个文件时执行（否则，zipWithIndex()不能保证文件中的偏移）但此方法适用于任何RDD分区数量，所以随时可以将文件分区为任意数量的块。

如何确定Apache Spark中的偏移量？

2 个答案: