我有一个进程,每小时将一堆数据推送到Blob存储区,并在我的存储容器中创建以下文件夹结构,如下所示:
cbq> EXPLAIN
> {CT SUM(l.lo_revenue) as revenue, o.d_year as year, p.p_brand1 as p_brand1
> {1121') as p UNNEST p.lineorder l UNNEST l.supplier s UNNEST l.orderdate o
> where s.s_region='AMERICA'
> group by o.d_year, p.p_brand1
> ;
{
"requestID": "160cbad9-1d32-4d67-9c94-1623bba27d51",
"signature": "json",
"results": [
{
"#operator": "Sequence",
"~children": [
{
"#operator": "Sequence",
"~children": [
{
"#operator": "PrimaryScan",
"index": "#primary",
"keyspace": "part",
"namespace": "default",
"using": "gsi"
},
{
"#operator": "Parallel",
"~child": {
"#operator": "Sequence",
"~children": [
{
"#operator": "Fetch",
"keyspace": "part",
"namespace": "default"
},
{
"#operator": "Filter",
"condition": "((`part`.`p_brand1`) = \"MFGR#1121\")"
},
{
"#operator": "InitialProject",
"result_terms": [
{
"expr": "(`part`.`p_brand1`)"
},
{
"expr": "(`part`.`lineorder`)"
}
]
},
{
"#operator": "FinalProject"
}
]
}
}
]
},
{
"#operator": "Alias",
"as": "p"
},
{
"#operator": "Parallel",
"~child": {
"#operator": "Sequence",
"~children": [
{
"#operator": "Unnest",
"as": "l",
"expr": "(`p`.`lineorder`)"
},
{
"#operator": "Unnest",
"as": "s",
"expr": "(`l`.`supplier`)"
},
{
"#operator": "Unnest",
"as": "o",
"expr": "(`l`.`orderdate`)"
},
{
"#operator": "Filter",
"condition": "((`s`.`s_region`) = \"AMERICA\")"
},
{
"#operator": "InitialGroup",
"aggregates": [
"sum((`l`.`lo_revenue`))"
],
"group_keys": [
"(`o`.`d_year`)",
"(`p`.`p_brand1`)"
]
}
]
}
},
{
"#operator": "IntermediateGroup",
"aggregates": [
"sum((`l`.`lo_revenue`))"
],
"group_keys": [
"(`o`.`d_year`)",
"(`p`.`p_brand1`)"
]
},
{
"#operator": "FinalGroup",
"aggregates": [
"sum((`l`.`lo_revenue`))"
],
"group_keys": [
"(`o`.`d_year`)",
"(`p`.`p_brand1`)"
]
},
{
"#operator": "Parallel",
"~child": {
"#operator": "Sequence",
"~children": [
{
"#operator": "InitialProject",
"result_terms": [
{
"as": "revenue",
"expr": "sum((`l`.`lo_revenue`))"
},
{
"as": "year",
"expr": "(`o`.`d_year`)"
},
{
"as": "p_brand1",
"expr": "(`p`.`p_brand1`)"
}
]
},
{
"#operator": "FinalProject"
}
]
}
}
]
}
],
"status": "success",
"metrics": {
"elapsedTime": "10.661133ms",
"executionTime": "10.575926ms",
"resultCount": 1,
"resultSize": 5600
}
}
等等
在我的Spark上下文中的表单我想访问所有/year=16/Month=03/Day=17/Hour=16/mydata.csv
/year=16/Month=03/Day=17/Hour=17/mydata.csv
并处理它们。我想我需要设置mydata.csv
,以便我们可以使用如下的递归搜索:
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
但是当我执行以下命令来查看我收到的文件数量时,它给了我一些非常大的数字,如下所示
val csvFile2 = sc.textFile("wasb://mycontainer@mystorage.blob.core.windows.net/*/*/*/mydata.csv")
理想情况下它应该返回我24 * 16 = 384,我也在容器上验证,它只有384 csvFile2.count
res41: Long = 106715282
个文件,但由于某些原因,我看到它返回106715282。
有人可以帮我理解我哪里出错吗?
此致 基兰
答案 0 :(得分:0)
Iterable<DataStreams> iterable= query.fetch();
Iterator<DataStreams> iterator=iterable.iterator();
while (iterator.hasNext()) {
dataStreamsList.add(iterator.next());
}
System.out.println("iteration done");
有两种类似的方法:SparkContext
和textFile
。
wholeTextFiles
将每个文件的每个行加载为RDD中的记录。所以textFile
将返回所有文件的总行数(在大多数情况下,例如你的文件将是一个很大的数字)。
count()
将每个整个文件加载为RDD中的记录。因此wholeTextFiles
将返回文件总数(在您的情况下为384)。