如何从一个非常大的HIVE表中查找记录,其中列header__timestamp,header__change_seq应该是最新更新,id应该是唯一的

时间:2018-02-10 02:39:33

标签: hadoop hive hql

我必须从hive表中找到记录,其中Id,der__timestamp,header__change_seq应该是唯一的但在表(Id,der__timestamp,header__change_seq)中可以复制,所以在这种情况下,如果记录重复,我必须只获取一条记录。

private HashMap<Integer,UserInformation> userInformationHashMap;


/**
 * Default json constructor`enter code here`
 * @return new user object
 */
@GetMapping(path = "/defaultUserInformation")
public UserInformation test()
{
    return new UserInformation("fname", "lname", "email", "pass");
}
/**
 * Gets the users information
 * @return users information
 */
@GetMapping (path = "/userInfo")
public UserInformation getUserInfo(@RequestParam ("id") int id){
    return userInformationHashMap.get(id);
}

/**
 * Sets the users information
 * @param userInformation userInformation model
 * @return users key
 */
@PostMapping (path = "/createUser")
public int createUser(@RequestBody UserInformation userInformation){

    if(this.userInformationHashMap == null){
        this.userInformationHashMap = new HashMap<>();
    }

    int maxKey = 0;

    if(this.userInformationHashMap.size() != 0){
        maxKey = Collections.max(this.userInformationHashMap.keySet()) + 1;
    }

    this.userInformationHashMap.put(maxKey,userInformation);

    return maxKey;
}

@PutMapping (path = "/updateUserInfo")
public void updateUserInfo(@RequestParam ("id") int id, @RequestBody UserInformation userInformation) {
    if (this.userInformationHashMap.containsKey(id)) {
        this.userInformationHashMap.put(id, userInformation);
    }
}

@DeleteMapping (path = "/deleteUser")
public void deleteUser(@RequestParam ("id") int id){
    this.userInformationHashMap.remove(id);
}

因此,不同ID的总数是count - > 244441250 但通过以上查询,我得到了数 - > 244442548 由于一些重复的记录,但我必须找到唯一的ID(header__change_seq和header__timestamp)应该最大。

1 个答案:

答案 0 :(得分:0)

@Rahul;请试试这个。它使用row_number(),因此在重复的id,header_timestamp和hearder_change_seq的情况下,它将只选择一个记录。希望能帮助到你。

select * 
from (
select *,
row_number() over ( partition by id order by header__timestamp desc, header__change_seq desc) as rnk 
from table_name) t
where t.rnk = 1;