检查字符串中的单词计数并删除计数较少的单词 - Hive

时间:2017-09-25 19:59:56

标签: sql hive

假设我有以下表格,

    <div class="our-specialties Container">
            <h3 class="primary-text"> Pizzas </h3>
            <div class="container-grid">
                <?php
                    $args = array(
                    'post_type' => 'specialties',
                    'post_per_page' => 10,
                    'orderby' => 'title',
                    'order' => 'ASC',
                    'category_name'=> 'pizza' 
                    ) ;
                $pizzas  =  new WP_Query($args);
                while($pizzas->have_posts()): $pizzas->the_post(); ?>

                <div class="columns2-4">
                    <a href="<?php the_permalink(); ?>">
                    <?php the_post_thumbnail('specialties');  ?>
                    <h4><?php the_title(); ?><span><?php the_field('price');?></span> </h4>
                        <?php the_content();  ?>
                    </a>
                </div>
            <?php  endwhile; wp_reset_postdata();   ?>
            </div>
            <h3 class="primary-text"> Others </h3>
            <div class="container-grid">
                <?php
                    $args = array(
                    'post_type' => 'specialties',
                    'post_per_page' => 10,
                    'orderby' => 'title',
                    'order' => 'ASC',
                    'category_name'=> 'other' 
                    ) ;
                $pizzas  =  new WP_Query($args);
                while($pizzas->have_posts()): $pizzas->the_post(); ?>

                <div class="columns2-4">
                    <a href="<?php the_permalink(); ?>">
                    <?php the_post_thumbnail('specialties');  ?>
                    <h4><?php the_title(); ?><span><?php the_field('price');?></span> </h4>
                    <?php the_content();  ?>
                    </a>
                </div>
            <?php  endwhile; wp_reset_postdata();   ?>
            </div>
        </div>

code for the style sheet

@media only screen and (min-width:768px){

        .container-grid{
            margin-left: -10px;
            margin-right: -10px;
        }
        .container-grid::after{
            content:'';
            display: block;
            clear: both;

        }
        [class*='columns']{
            padding: 0 10px;
            float: left;
        }
        .columns2-4{
            width: 50%;
        }


    }

现在我想找到每天的字数,并删除少数字。为了找到单词count,我写了以下查询,

date_part             string_word                          id
2017-08-08       India America Advance Apartments           1
2017-08-08       Apartments Planner Headlines               1
2017-08-08       India America Headlines Gucci              1
2017-08-08       Images Same Thing Africa                   2
2017-08-08       Images                                     2
2017-08-07       India America Advance Apartments           2
2017-08-07       Apartments Planner Headlines               3
2017-08-07       India America Headlines Gucci              3
2017-08-07       Images Same Thing Africa                   3
2017-08-07       Images                                     4

这将提供以下内容,

SELECT date_part, word, COUNT(*) as total_word_count
FROM table_name LATERAL VIEW explode(split(string_word, ' ')) lTable as word 
where date_part > '2017-08-05'
GROUP BY date_part, word

现在我想删除数小于2的单词。即每个日期应删除1个计数的单词。以下应该是输出,

date_part       word        total_word_count
2017-08-08      India            2
2017-08-08      America          2
2017-08-08      Advance          1
2017-08-08      Apartments       2
2017-08-08      Planner          1
2017-08-08      Headlines        2
2017-08-08      Gucci            1
2017-08-08      Images           2
2017-08-08      Same             1
2017-08-08      Thing            1
2017-08-08      Africa           1
2017-08-07      India            2
2017-08-07      America          2
2017-08-07      Advance          1
2017-08-07      Apartments       2
2017-08-07      Planner          1
2017-08-07      Headlines        2
2017-08-07      Gucci            1
2017-08-07      Images           2
2017-08-07      Same             1
2017-08-07      Thing            1
2017-08-07      Africa           1

这里删除了包含1个计数的单词。这是我想要获得的输出,这也必须与每一天相关。

有人可以帮我这么做吗?

由于

1 个答案:

答案 0 :(得分:0)

select      t.date_part
           ,regexp_replace(t.string_word,concat('\\s?\\b(',e.words,')\\b'),'')    as string_word
           ,t.id

from                    table_name  as t

            join       (select      date_part
                                   ,concat_ws('|',collect_list (col)) as words

                        from       (select      date_part
                                               ,e.col

                                    from        table_name t
                                                lateral view explode(split(t.string_word,'\\s+')) e

                                    group by    date_part
                                               ,e.col

                                    having      count(*) = 1
                                    ) e

                        group by    date_part
                        ) e

            on          e.date_part =
                        t.date_part
;
+-------------+---------------------------+-----+
|  date_part  |        string_word        | id  |
+-------------+---------------------------+-----+
| 2017-08-07  | India America Apartments  | 2   |
| 2017-08-07  | Apartments Headlines      | 3   |
| 2017-08-07  | India America Headlines   | 3   |
| 2017-08-07  | Images                    | 3   |
| 2017-08-07  | Images                    | 4   |
| 2017-08-08  | India America Apartments  | 1   |
| 2017-08-08  | Apartments Headlines      | 1   |
| 2017-08-08  | India America Headlines   | 1   |
| 2017-08-08  | Images                    | 2   |
| 2017-08-08  | Images                    | 2   |
+-------------+---------------------------+-----+