Question

我有两个连接在一起的表。第一个表具有跨国级别的详细信息，导致我加入第二个表的键重复。当我离开联接第二个表时，度量“ company_spend”被高度夸大。

我需要一种方法来仅保留重复数据的单个值，我的想法是仅在那些列上运行一个不同的函数，但是我没有看到Bigquery仅在少数列上支持不同的函数，但没有所有。

 SELECT UPPER(cwnextt.Current_Contract_Number)         AS Current_Contract_Number,
       UPPER(cwnextt.Replacement_Contract_Number)     AS Replacement_Contract_Number,
       UPPER(cwnextt.Current_Contract_Name)           AS Current_Contract_Name,
       UPPER(cwnextt.Supplier_Top_Parent_Entity_Code) AS Supplier_Top_Parent_Entity_Code,
       UPPER(cwnextt.Supplier_Top_Parent_Name)        AS Supplier_Top_Parent_Name,
       UPPER(cwnextt.company_Entity_Code)             AS company_Entity_Code,
       UPPER(cwnextt.Facility_Name)                   AS Facility_Name,
       smart.company_Spend                            AS companySpend
  FROM `test_etl_field.contracts_with_member_entity_codes_test_view_2` cwnextt 
  --this table is what is causing the below table to duplicate,
  --but I need all of this data AS well in its current format. 
LEFT JOIN `test.trans_analysis` tsa 
    ON TRIM(UPPER(cwnextt.company_entity_code)) = TRIM(UPPER(tsa.company_entity_code)) 
       AND TRIM(UPPER(cwnextt.Supplier_Top_Parent_Entity_Code)) = TRIM(UPPER(tsa.manufacturer_top_parent_entity_code)) 
       AND TRIM(UPPER(cwnextt.Current_Contract_Name)) = TRIM(UPPER(tsa.contract_category)) 
       AND cwnextt.spend_period_yyyyqmm = tsa.spend_period_yyyyqmm 
       --this table contains "company_spend" which is now duplicated 
LEFT JOIN `test_etl_field.ecr_smart_data` smart 
    ON smart.company_entity_code = cwnextt.company_entity_code 
       AND (smart.contract_number = cwnextt.current_contract_number 
    OR smart.contract_number = cwnextt.replacement_contract_number) 
       AND smart.month_key = cwnextt.spend_period_yyyyqmm

如果可以创建一些东西，使company_spend不会在第二个左联接上重复，那是我的追求。

Answer 1

不确定是否了解问题的所有详细信息，但这是BigQuery doc中的事实：

SELECT DISTINCT

SELECT DISTINCT语句丢弃重复的行   并仅返回其余行。

您不能在特定的列上应用DISTINCT，因为这没有意义。假设您有4列，并在3列上调用DISTINCT，那么SQL应该对最后一列做什么？
您必须告诉SQL剩下的列保留哪个值，并且GROUP BY是这里的正确解决方案。

因此，如果您想：

删除已重复的列：只需调整您的SELECT以仅获取所需的列
删除在特定列中具有相同值的行：我建议在目标列上使用GROUP BY并采用您想要的汇总（首先是平均，总和或其他任何值）其余的。
如果另一行具有相同的值，则从一行中删除该值：您可能不想这样做。行必须保持其价值，您将无法收回价值。此外，同样的问题，您要保留哪一行？

希望这会有所帮助！如果您需要更具体的答案，请随时对您的问题进行澄清。

Answer 2

尽管我无法在SQL中解决此问题，但我通过固定LOD使用Tableau来汇总通过重复项传递的数据，以便最终用户可以准确地可视化输出。不理想，但是SQL路由没有意义。

Bigquery-删除某些列的重复项，但不是全部

2 个答案: