incremental_mock = spark.createDataFrame(...) # dataframe having the same schema as `incremental`
incremental_mock.registerTempTable('incremental_mock')
latest = spark.table('latest').withMockTable('incremental', 'incremental_mock') # `withMockTable` is the sort of functionality I'm searching - should make `latest` use `incremental_mock` instead of `incremental`
result = latest.first()
expected = spark.sql("""
SELECT * FROM incremental_mock
ORDER BY time DESC
LIMIT 1
""").first()
assert result == expected
这是我的查询,我正在使用SQL Server 13,它返回6752行,其中有44行被重复。我已尽我所能避免显示重复的条目,但是我想法不对,所以我在寻找一些有用的提示:-)最大的问题之一是所有字段都是必需的,所以我不能摆脱会导致SELECT DISTINCT问题的“ AEC.aec_workstation.geometry”。
答案 0 :(得分:1)
从返回重复行的第一个表中查找PK值,并从以下查询开始:
SELECT
COUNT(1)
FROM
AEC.gwd_people
WHERE
AEC.gwd_people.PrimaryKeyColumn = 'SomeValue'
现在开始添加一个接一个的连接,每次检查COUNT(1)
的结果:
SELECT
COUNT(1)
FROM
AEC.gwd_people
LEFT OUTER JOIN AEC.view_iam_r_unitp_building ON AEC.view_iam_r_unitp_building.IDUNITPROD = AEC.gwd_people.cod_sector
WHERE
AEC.gwd_people.PrimaryKeyColumn = 'SomeValue'
然后...
SELECT
COUNT(1)
FROM
AEC.gwd_people
LEFT OUTER JOIN AEC.view_iam_r_unitp_building ON AEC.view_iam_r_unitp_building.IDUNITPROD = AEC.gwd_people.cod_sector
LEFT OUTER JOIN AEC.aec_r_workstation_people ON AEC.gwd_people.cod_people = AEC.aec_r_workstation_people.cod_people
WHERE
AEC.gwd_people.PrimaryKeyColumn = 'SomeValue'
直到您看到行数突然增加,直到您不希望这样做。您最有可能:
...或这些的组合。
答案 1 :(得分:0)
您的餐桌设计很难理解他们之间的关系。这就是我的样子:
gwd_department {1:n} gwd_people gwd_people {m:n} aec_workstation gwd_people {m:n} view_iam_r_unitp_building gwd_people {?:n} gwd_cost_center
因此,对于一个与3个aec_workstation和4个view_iam_r_unitp_buildings相关联的人,您将产生3 x 4 = 12个结果行。 aec_workstation和view_iam_r_unitp_building之间是否没有其他关系?如果没有,那么为什么将它们合并到查询中?
我不知道cod_cdc应该是cod_cost_center的缩写还是其他名称。如果这也是m:n关系,那么您将再次使用与aec_workstation和view_iam_r_unitp_building相关的gwd_cost_center做同样的事情。
这样说:要么添加缺失的标准,要么问自己毕竟要选择什么。