Question

我在目录中有一组文件，我想从这些文件中读取一些特定文件作为一个RDD，例如：

2000.txt
2001.txt
2002.txt
2003.txt
2004.txt
2005.txt
2006.txt
2007.txt
2008.txt
2009.txt
2010.txt
2011.txt
2012.txt

我希望从这些文件中读取每个特定范围，例如：

range = 4
from = 2004

then read files : 2004.txt , 2005.txt , 2006.txt , 2007.txt
as one RDD (data)

如何在火花scala中做到这一点？

Answer 1

由于Spark的textFile公开了Hadoop的FileInputFormat，因此您可以指定varargs目录和通配符。因此，这应该工作（未经测试）：

def datedRange(fromYear: Int, years: Int) = 
  sc.textFile(Seq.tabulate(years)(x => fromYear + x).map(y => s"/path/to/dir/$y"): _*)