Question

我有一个Hive表，其中包含客户呼叫的数据。为简单起见，请考虑它有2列，第一列包含客户ID，第二列包含调用的时间戳（unix时间戳）。

我可以查询此表以查找每个客户的所有电话：

SELECT * FROM mytable SORT BY customer_id, call_time;

结果是：

Customer1    timestamp11
Customer1    timestamp12
Customer1    timestamp13
Customer2    timestamp21
Customer3    timestamp31
Customer3    timestamp32
...

是否可以为每个客户创建一个Hive查询，从第二次调用开始，两次成功调用之间的时间间隔？对于以上示例，查询应返回：

Customer1    timestamp12-timestamp11
Customer1    timestamp13-timestamp12
Customer3    timestamp32-timestamp31
...

我尝试调整sql solution中的解决方案，但我坚持使用Hive限制：it accepts subqueries only in FROM和joins must contain only equalities。

谢谢。

EDIT1：

我尝试使用Hive UDF功能：

public class DeltaComputerUDF extends UDF {
private String previousCustomerId;
private long previousCallTime;

public String evaluate(String customerId, LongWritable callTime) {
    long callTimeValue = callTime.get();
    String timeDifference = null;

    if (customerId.equals(previousCustomerId)) {
        timeDifference = new Long(callTimeValue - previousCallTime).toString();
    }

    previousCustomerId = customerId;
    previousCallTime = callTimeValue;

    return timeDifference;
}}

并使用名称“delta”。

但似乎（从日志和结果中）它正在MAP时间使用。由此产生了2个问题：

首先： 在使用此功能之前，必须按客户ID和时间戳对表数据进行排序。查询：

 SELECT customer_id, call_time, delta(customer_id, call_time) FROM mytable DISTRIBUTE BY customer_id SORT BY customer_id, call_time;

不起作用，因为排序部分是在REDUCE时间执行的，在我的功能使用很久之后。

我可以在使用该函数之前对表数据进行排序，但我对此并不满意，因为这是我希望避免的开销。

第二： 如果是分布式Hadoop配置，数据将在可用的作业跟踪器之间进行分割。所以我相信这个函数有多个实例，每个映射器一个，所以可以在2个映射器之间分配相同的客户数据。在这种情况下，我将失去客户电话，这是不可接受的。

我不知道如何解决这个问题。我知道DISTRIBUTE BY确保将具有特定值的所有数据发送到同一个reducer（从而确保SORT按预期工作），是否有人知道mapper是否有类似内容？

接下来我计划遵循libjack的建议来使用reduce脚本。在其他一些hive查询之间需要这种“计算”，所以我想尝试Hive提供的所有内容，然后再按照Balaswamy vaddeman的建议移动到另一个工具。

EDIT2：

我开始研究自定义脚本解决方案。但是，在Programming Hive第14章的第一页（本章介绍自定义脚本）中，我发现了以下段落：

流式传输通常不如编码类似的UDF或 InputFormat对象。序列化和反序列化数据以将其传入和出管道效率相对较低。调试整体也更难程序统一。但是，它对快速原型设计很有用并利用现有的非Java编写代码。对于蜂巢不想编写Java代码的用户，可以非常有效方法

很明显，自定义脚本在效率方面不是最佳解决方案。

但是我应该如何保留我的UDF功能，但要确保它在分布式Hadoop配置中按预期工作？我在语言手册UDF wiki页面的UDF Internals部分找到了这个问题的答案。如果我写我的查询：

 SELECT customer_id, call_time, delta(customer_id, call_time) FROM (SELECT customer_id, call_time FROM mytable DISTRIBUTE BY customer_id SORT BY customer_id, call_time) t;

它在REDUCE时执行，DISTRIBUTE BY和SORT BY构造保证来自同一客户的所有记录都按照调用的顺序由同一个reducer处理。

所以上面的UDF和这个查询构造解决了我的问题。

（很抱歉没有添加链接，但我不允许这样做，因为我没有足够的声望点）

Answer 1

这是一个老问题，但是为了将来的参考，我在这里写了另一个命题：

Hive Windowing functions允许在查询中使用上一个/下一个值。

类似的代码查询可能是：

SELECT customer_id，LAG（call_time，1,0）OVER（PARTITION BY customer_id ORDER BY call_time） - call_time FROM mytable;

Answer 2

您可以将显式MAP-REDUCE与其他编程语言（如Java或Python）一起使用。从map {cutomer_id,call_time}和reducer中发出的地方你将获得{customer_id,list{time_stamp}}，在reducer中你可以对这些时间戳进行排序并处理数据。

Answer 3

也许有人遇到类似的要求，我找到的解决方案如下：

1）创建自定义功能：

package com.example;
// imports (they depend on the hive version)
@Description(name = "delta", value = "_FUNC_(customer id column, call time column) "
    + "- computes the time passed between two succesive records from the same customer. "
    + "It generates 3 columns: first contains the customer id, second contains call time "
    + "and third contains the time passed from the previous call. This function returns only "
    + "the records that have a previous call from the same customer (requirements are not applicable "
    + "to the first call)", extended = "Example:\n> SELECT _FUNC_(customer_id, call_time) AS"
    + "(customer_id, call_time, time_passed) FROM (SELECT customer_id, call_time FROM mytable "
    + "DISTRIBUTE BY customer_id SORT BY customer_id, call_time) t;")
public class DeltaComputerUDTF extends GenericUDTF {
private static final int NUM_COLS = 3;

private Text[] retCols; // array of returned column values
private ObjectInspector[] inputOIs; // input ObjectInspectors
private String prevCustomerId;
private Long prevCallTime;

@Override
public StructObjectInspector initialize(ObjectInspector[] ois) throws UDFArgumentException {
    if (ois.length != 2) {
        throw new UDFArgumentException(
                "There must be 2 arguments: customer Id column name and call time column name");
    }

    inputOIs = ois;

    // construct the output column data holders
    retCols = new Text[NUM_COLS];
    for (int i = 0; i < NUM_COLS; ++i) {
        retCols[i] = new Text();
    }

    // construct output object inspector
    List<String> fieldNames = new ArrayList<String>(NUM_COLS);
    List<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>(NUM_COLS);
    for (int i = 0; i < NUM_COLS; ++i) {
        // column name can be anything since it will be named by UDTF as clause
        fieldNames.add("c" + i);
        // all returned type will be Text
        fieldOIs.add(PrimitiveObjectInspectorFactory.writableStringObjectInspector);
    }

    return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
}

@Override
public void process(Object[] args) throws HiveException {
    String customerId = ((StringObjectInspector) inputOIs[0]).getPrimitiveJavaObject(args[0]);
    Long callTime = ((LongObjectInspector) inputOIs[1]).get(args[1]);

    if (customerId.equals(prevCustomerId)) {
        retCols[0].set(customerId);
        retCols[1].set(callTime.toString());
        retCols[2].set(new Long(callTime - prevCallTime).toString());
        forward(retCols);
    }

    // Store the current customer data, for the next line
    prevCustomerId = customerId;
    prevCallTime = callTime;
}

@Override
public void close() throws HiveException {
    // TODO Auto-generated method stub

}

}

2）创建一个包含此功能的jar。假设jarname是myjar.jar。

3）用Hive将jar复制到机器上。假设它被放在/ tmp

中

4）在Hive中定义自定义函数：

ADD JAR /tmp/myjar.jar;
CREATE TEMPORARY FUNCTION delta AS 'com.example.DeltaComputerUDTF';

5）执行查询：

SELECT delta(customer_id, call_time) AS (customer_id, call_time, time_difference) FROM 
  (SELECT customer_id, call_time FROM mytable DISTRIBUTE BY customer_id SORT BY customer_id, call_time) t;

<强>说明： 的

一个。我假设call_time列将数据存储为bigint。如果它是字符串，在进程函数中我们将其检索为字符串（就像我们对customerId一样），然后将其解析为Long

湾我决定使用UDTF而不是UDF，因为这样它可以生成所需的所有数据。否则（使用UDF）需要过滤生成的数据以跳过NULL值。因此，使用原始帖子的第一次编辑中描述的UDF函数（DeltaComputerUDF），查询将是：

SELECT customer_id, call_time, time_difference 
FROM 
  (
    SELECT delta(customer_id, call_time) AS (customer_id, call_time, time_difference) 
    FROM 
      (
         SELECT customer_id, call_time FROM mytable
         DISTRIBUTE BY customer_id
         SORT BY customer_id, call_time
       ) t
   ) u 
WHERE time_difference IS NOT NULL;

℃。无论表中的行顺序如何，这两个函数（UDF和UDTF）都可以正常工作（因此在使用delta函数之前不需要按客户ID和调用时间对表数据进行排序）

使用Hive查询计算Hadoop中连续记录之间的差异

3 个答案: