Spark以相同的名称从所有子文件夹中递归读取文件

时间:2016-03-17 14:25:25

标签: scala apache-spark hdinsight

我有一个进程,每小时将一堆数据推送到Blob存储区,并在我的存储容器中创建以下文件夹结构,如下所示: cbq> EXPLAIN > {CT SUM(l.lo_revenue) as revenue, o.d_year as year, p.p_brand1 as p_brand1 > {1121') as p UNNEST p.lineorder l UNNEST l.supplier s UNNEST l.orderdate o > where s.s_region='AMERICA' > group by o.d_year, p.p_brand1 > ; { "requestID": "160cbad9-1d32-4d67-9c94-1623bba27d51", "signature": "json", "results": [ { "#operator": "Sequence", "~children": [ { "#operator": "Sequence", "~children": [ { "#operator": "PrimaryScan", "index": "#primary", "keyspace": "part", "namespace": "default", "using": "gsi" }, { "#operator": "Parallel", "~child": { "#operator": "Sequence", "~children": [ { "#operator": "Fetch", "keyspace": "part", "namespace": "default" }, { "#operator": "Filter", "condition": "((`part`.`p_brand1`) = \"MFGR#1121\")" }, { "#operator": "InitialProject", "result_terms": [ { "expr": "(`part`.`p_brand1`)" }, { "expr": "(`part`.`lineorder`)" } ] }, { "#operator": "FinalProject" } ] } } ] }, { "#operator": "Alias", "as": "p" }, { "#operator": "Parallel", "~child": { "#operator": "Sequence", "~children": [ { "#operator": "Unnest", "as": "l", "expr": "(`p`.`lineorder`)" }, { "#operator": "Unnest", "as": "s", "expr": "(`l`.`supplier`)" }, { "#operator": "Unnest", "as": "o", "expr": "(`l`.`orderdate`)" }, { "#operator": "Filter", "condition": "((`s`.`s_region`) = \"AMERICA\")" }, { "#operator": "InitialGroup", "aggregates": [ "sum((`l`.`lo_revenue`))" ], "group_keys": [ "(`o`.`d_year`)", "(`p`.`p_brand1`)" ] } ] } }, { "#operator": "IntermediateGroup", "aggregates": [ "sum((`l`.`lo_revenue`))" ], "group_keys": [ "(`o`.`d_year`)", "(`p`.`p_brand1`)" ] }, { "#operator": "FinalGroup", "aggregates": [ "sum((`l`.`lo_revenue`))" ], "group_keys": [ "(`o`.`d_year`)", "(`p`.`p_brand1`)" ] }, { "#operator": "Parallel", "~child": { "#operator": "Sequence", "~children": [ { "#operator": "InitialProject", "result_terms": [ { "as": "revenue", "expr": "sum((`l`.`lo_revenue`))" }, { "as": "year", "expr": "(`o`.`d_year`)" }, { "as": "p_brand1", "expr": "(`p`.`p_brand1`)" } ] }, { "#operator": "FinalProject" } ] } } ] } ], "status": "success", "metrics": { "elapsedTime": "10.661133ms", "executionTime": "10.575926ms", "resultCount": 1, "resultSize": 5600 } }

等等

在我的Spark上下文中的

表单我想访问所有/year=16/Month=03/Day=17/Hour=16/mydata.csv /year=16/Month=03/Day=17/Hour=17/mydata.csv并处理它们。我想我需要设置mydata.csv,以便我们可以使用如下的递归搜索:

sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")

但是当我执行以下命令来查看我收到的文件数量时,它给了我一些非常大的数字,如下所示

val csvFile2 = sc.textFile("wasb://mycontainer@mystorage.blob.core.windows.net/*/*/*/mydata.csv") 理想情况下它应该返回我24 * 16 = 384,我也在容器上验证,它只有384 csvFile2.count res41: Long = 106715282 个文件,但由于某些原因,我看到它返回106715282。

有人可以帮我理解我哪里出错吗?

此致 基兰

1 个答案:

答案 0 :(得分:0)

Iterable<DataStreams> iterable= query.fetch(); Iterator<DataStreams> iterator=iterable.iterator(); while (iterator.hasNext()) { dataStreamsList.add(iterator.next()); } System.out.println("iteration done"); 有两种类似的方法:SparkContexttextFile

wholeTextFiles将每个文件的每个加载为RDD中的记录。所以textFile将返回所有文件的总行数(在大多数情况下,例如你的文件将是一个很大的数字)。

count()将每个整个文件加载为RDD中的记录。因此wholeTextFiles将返回文件总数(在您的情况下为384)。