Question

我有一张桌子

company   invest_type  date  round
----------------------------------
A         regular      2011  
A         regular      2011  
A         regular      2012  
A         special      2010  abcd
A         special      2010  abcd

B         regular      2011  
B         regular      2011  
B         regular      2012  
B         special      2010  cdcd
B         special      2010  zzzz

C         regular      2012  
C         regular      2012  
C         special      2010  
C         special      2010

我想像这样显示它们

company  dates
A        2010,2011,2011,2012
B        2010,2010,2011,2011,2012
C        2010,2012,2012

也就是说，特殊投资日期会被扣除（通常是指定的轮次），但不会定期投资。

我已经尝试过`GROUP_CONCAT（DISTINCT date，invest_type），但它并没有接近。基本上我想从'date'中获取不同的日期值，只要round不是'null'，在这种情况下我想要重复的值。如果存在回合，则基于回合进行重复数据删除，如果不假设所有特殊投资都是同一回合并重复数据删除。

Answer 1

使用子查询将常规投资中的空轮替换为计数器，以便行将是唯一的，然后使用package wordCount_test; import static org.junit.Assert.*; import java.io.File; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.junit.After; import org.junit.Before; import org.junit.Test; public class TestSparkWordCount { JavaSparkContext jsc; File txtFile; @Before public void setUp() throws Exception { jsc = new JavaSparkContext("local[2]", "testSparkWordCount"); txtFile = new File("AIW_WordCount"); if(txtFile.exists()){ txtFile.delete(); } } @After public void tearDown() throws Exception { if(jsc != null){ jsc.stop(); jsc = null; } } @Test public void testSparkInit() { assertNotNull(jsc.sc()); } }去除其他所有内容。然后在此使用SELECT DISTINCT。

GROUP_CONCAT

DEMO

Answer 2

您可以在内联视图中执行重复数据删除，在外部查询中执行group_concat()，如下所示：

select
  company,
  group_concat(`date` order by `date` ASC separator ',') as dates
from (
  select distinct company, `date`
  from my_table
  where invest_type = 'special'
  union all
  select company, `date`
  from my_table
  where invest_type != 'special'
) dedup
group by company

GROUP CONCAT有些不同而不是

2 个答案: