因此,我想将三个数据源结合在一起以产生一些输出。
File1.json: 378mb
File2.json: 72kb
file3.json: 500kb
@extractFile1 = EXTRACT columnList FROM PATH "path/File1.json" USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();
@extractFile2 = EXTRACT columnList FROM PATH "path/File2.json" USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();
@extractFile3 = EXTRACT columnList FROM PATH "path/File3.json" USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor();
@result =
SELECT f1.column, f2.column, f1.column, f3.column
from @extractFile3 AS f3
INNER JOIN (
SELECT f3new.column,
f3new.column AS somename
from @extractFile1 AS f1
INNER JOIN @ExtractFile3 f3new ON f1.column == f3new.column
GROUP BY f3new.column
) AS first
ON f3.column == somename
INNER JOIN @extractFile1 AS f1 ON f3.column == f1.column
INNER JOIN @extractFile2 as f2 ON f1.column == f3.column
执行此操作后,作业图中的合并操作将显示Writes:195GB,并且仍在继续。它在一个顶点上运行了70分钟。
有人知道执行计划中的合并操作如何甚至能够写入那么多数据吗?
答案 0 :(得分:0)
您是否尝试过打开InputFileGrouping preview feature?当在ADLA中处理数百个小型JSON文件时,对我来说,它的性能得到了极大的改善。