我正在尝试在bigquery脚本上进行ROW_NUMBER()OVER(PARTITION BY ... ORDER BY ...)的操作,并继续遇到“超出资源”错误。
此表的大小为219.96 GB,行数为1,611,220,127
这是脚本:
With cte as (
SELECT
Source,
ROW_NUMBER() OVER (PARTITION BY FirstName, LastName, MiddleName, Address, Address2, City, State, Zip ORDER BY Attom_ID DESC) as rnk
,FirstName, LastName, MiddleName, Gender, Age, DOB, Address, Address2, City, State, Zip, Zip4, TimeZone, Income, HomeValue, Networth, MaritalStatus, IsRenter, HasChildren, CreditRating, Investor, LinesOfCredit, InvestorRealEstate, Traveler, Pets, MailResponder, Charitable, PolicalDonations, PoliticalParty, Attom_ID, GEOID, Score, Score1, Score2, Score3, Score4, Score5, Latitude, Longitude
from `db.ds.tblA`
) select * from cte where rnk = 1
虽然这是一个表,但这是联接的结果,其中,PRIOR到ATTOM_ID的所有列均来自一个表,而ATTOM_ID,GEOID等则来自第二个表。我相信结果集中会有一些笛卡尔运算。
表中有许多重复项,我正在尝试对表进行重复数据删除。我担心尝试将MAX_ATTOM_ID与GROUP_BY一起使用,因为我想确保自己附带正确关联的GEOID和SCORES。我不想把它混在一起。
问题在于此特定查询超出了资源,因此我想知道这里是否有任何选项。谢谢!
答案 0 :(得分:2)
以下等同于您的原始查询(按结果),通常可以解决“超出资源”的问题
#standardSQL
SELECT r.* FROM (
SELECT
ARRAY_AGG(STRUCT(Source,FirstName, LastName, MiddleName, Gender, Age, DOB, Address, Address2, City, State, Zip, Zip4, TimeZone, Income, HomeValue, Networth, MaritalStatus, IsRenter, HasChildren, CreditRating, Investor, LinesOfCredit, InvestorRealEstate, Traveler, Pets, MailResponder, Charitable, PolicalDonations, PoliticalParty, Attom_ID, GEOID, Score, Score1, Score2, Score3, Score4, Score5, Latitude, Longitude) ORDER BY Attom_ID DESC LIMIT 1)[OFFSET(0)]
FROM `db.ds.tblA`
GROUP BY FirstName, LastName, MiddleName, Address, Address2, City, State, Zip
)