Question

我得到了一位工程师的一些数据，她想知道我是否可以总结她的数据，以便如果不在一行上就可以减少行数。下面显示了一些数据的github链接，你可以看到它充满了冗余和空值。我尝试过一个简单的分组但是没有用。我现在正在创建多个视图，每个视图添加一列。在创建每个视图时，我声明每列不是null，因此我可以尝试减少空行。但是，由于冗余，我仍然会获得冗余的样本ID和列值。以下是我为其中一个观点所做的事情：

create or replace view leslie_6 as
select distinct s.SMPL_ID,s.EFFECTIVENESS_PNS_SCORE_,
s.PNS_BLANK,s.pns_nacl,
s.settable_solids,
t.perc_pass_10m
from leslie_5 s right join leslietable t on s.SMPL_ID = t.smpl_id
where s.EFFECTIVENESS_PNS_SCORE_ is not null and
s.PNS_BLANK is not null and 
s.pns_nacl is not null and
s.settable_solids is not null and 
t.perc_pass_10m is not null
group by s.SMPL_ID,s.EFFECTIVENESS_PNS_SCORE_,s.PNS_BLANK,
s.pns_nacl,s.settable_solids,t.perc_pass_10m
order by s.SMPL_ID

另一个问题是样本ID是我对密钥最接近的问题。我希望我能够很好地分解数据，以便sampleID可以成为关键，但这并不是真的有用。这张桌子超过7500行，很乱。

以下是我发布到GitHub的数据示例。如果滚动到数据底部，则可以获得水平滚动条。

https://github.com/thomasawolff/verification_text_data/blob/master/Lydia%20query%20deicers%2020161005_sample.csv

Answer 1

这是我最终用于解决此问题的查询。在所有列上使用max（）非常有效。然后我减去了测试ID，每列有多个值输入，剩下的是正确的。我不认为使用max（）可以处理任何非数字值。它确实起作用可能是因为这些值比空单元格更有价值。

但我可以使用类似的东西：

select max(to_number(regexp_substr())

我可以这样做以防万一这需要再次完成。我必须专注于删除双重条目，因为它们搞砸了那些测试ID的max（）输出

---*** Takes the max or the value of the only entry in a column
create or replace view maxdataset as
select t.smpl_id,
t.matl_cd,t.geog_area_t,
max(t.assay) as assay,
max(t.effectiveness_pns_score_) as effectiveness_pns_score,
max(t.pns_blank) as pns_blank,
max(t.pns_nacl) as pns_nacl,
max(t.settable_solids) as settable_solids,
max(t.perc_pass_10m) as perc_pass_10m,
max(t.ph) as ph,
max(t.as_) as as_,
max(t.ba) as ba,
max(t.cd) as cd,
max(t.cr) as cr,
max(t.cu) as cu,
max(t.pb) as pb,
max(t.hg) as hg,
max(t.se) as se,
max(t.zn) as zn,
max(t.cn) as cn,
max(t.p) as p,
max(t.s) as s,
max(t.sulfate) as sulfate,
max(t.phosphate) as phosphate,
max(t.k) as k,
max(t.ca) as ca,
min(t.mg) as mg,
max(t.nitrite) as nitrite,
max(t.nitrate) as nitrate,
max(t.chloride) as chloride
from leslietable t
group by t.smpl_id,t.matl_cd,t.geog_area_t
;
---*** Removes just test ID rows with columns having more than one entry per test ID
create or replace view doubleEffect_short as
select t.smpl_id
from LESLIETABLE t inner join LESLIETABLE s
on t.smpl_id = s.smpl_id where 
(t.effectiveness_pns_score_ <> s.effectiveness_pns_score_) or 
(t.pns_blank <> s.pns_blank) or (t.pns_nacl <> s.pns_nacl) or
(t.settable_solids <> s.settable_solids) or
(t.perc_pass_10m <> s.perc_pass_10m) or
(t.ph <> s.ph) or (t.as_ <> s.as_) or
(t.as_ <> s.as_) or (t.ba <> s.ba) or
(t.cd <> s.cd) or (t.cr <> s.cr) or
(t.cu <> s.cu) or (t.pb <> s.pb) or
(t.hg <> s.hg) or (t.se <> s.se) or
(t.zn <> s.zn) or (t.cn <> s.cn) or
(t.p <> s.p) or (t.sulfate <> s.sulfate) or
(t.phosphate <> s.phosphate) or
(t.k <> s.k) or (t.ca <> s.ca) or
(t.mg <> s.mg) or (t.nitrite <> s.nitrite) or
(t.nitrate <> s.nitrate) or (t.chloride <> s.chloride)
group by t.smpl_id
order by t.smpl_id
;
---*** Outputs all columns from max data set having more than one value per column
create or replace view final_data_sifter as
select  t.smpl_id,t.matl_cd,t.geog_area_t,t.assay,t.effectiveness_pns_score,
t.pns_blank,t.pns_nacl,t.settable_solids,t.perc_pass_10m,t.ph,t.as_,t.ba,t.cd,
t.cr,t.cu,t.pb,t.hg,t.se,t.zn,t.cn,t.p,t.s,t.sulfate,t.phosphate,t.k,t.ca,t.mg,
t.nitrite,t.nitrate,t.chloride 
from maxdataset t join doubleeffect_short s
on t.smpl_id = s.smpl_id
;
---*** Sutracts rows with multiple values per column from max data set 
create or replace view finalDataset_incomplete as
select * from maxdataset t
minus
select * from final_data_sifter

使用SQL

1 个答案: