HiveQL& XPath - 如何提取值并替换一些字符

时间:2014-02-28 01:08:50

标签: xml xpath hadoop hive hiveql

我有一个存储在配置单元日志表中的XML blob(如下所示)。

<user>
    <uid>1424324325</uid>
    <attribs>
        <field>
        ...
        </field>
        <field>
            <name>first</name>
            <value>Joh,n</value>
        </field>
        <field>
        ...
        </field>
        <field>
            <name>last</name>
            <value>D,oe</value>
        </field>
        <field>
        ...
        </field>
    </attribs>
</user>

hive表中的每一行都有关于不同用户的信息,我想提取uid,名字和姓氏的值(从名称中删除任何逗号)。

1424324325  John    Doe
1424435463  Jane    Smith

我能够从XML中提取值。

SELECT uid, fn, ln
FROM log_table
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/uid/text()')) uids as uid
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "first_name"]/value/text()')) fns as fn
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "last_name"]/value/text()')) lns as ln;

然而,我正试图从第一个名字中删除不必要的逗号(如果它们存在)。姓。

当我尝试使用下面显示的任何方法提取名字时,结果为空。

LATERAL VIEW explode(xpath(logs['users_updates'], '/users/attribs/field[name = "first_name"]/value/replace(text(),",","")')) fns as fn

LATERAL VIEW explode(xpath(logs['users_updates'], '/users/attribs/field[name = "first_name"]/value/translate(text(),",","")')) fns as fn

当我按下图所示尝试时,替换有关无效功能的抱怨,而翻译会在不删除额外逗号的情况下提取数据。

LATERAL VIEW explode(xpath(logs['users_updates'], replace('/subscriberUpdates/updates/field[name = "first_name"]/value/text()',",",""))) fns as fn

LATERAL VIEW explode(xpath(logs['users_updates'], translate('/subscriberUpdates/updates/field[name = "first_name"]/value/text()',",",""))) fns as fn

如何在名称值中没有逗号的情况下提取信息?

1424324325  John    Doe
1424435463  Jane    Smith

最终解决方案: 这是Jens建议之后的最终工作查询

SELECT uid, regexp_replace(fn,","," ") as fname, regexp_replace(ln,","," ") as lname
FROM log_table
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/uid/text()')) uids as uid
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "first_name"]/value/text()')) fns as fn
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "last_name"]/value/text()')) lns as ln;

1 个答案:

答案 0 :(得分:1)

Hive中不支持XPath 2.0。这会影响你的问题两次:

  • 不允许在轴步骤中进行函数调用。虽然//value/translate(text(), ',', '')(为每个<value/>元素调用translate)是有效的XPath 2.0,但您无法在XPath 1.0中执行此操作。另一方面,translate(//value, ',', '')返回连接为单个字符串的所有<value/>个项目中的所有文本节点。
  • XPath 1.0中没有replace函数。

传递包含逗号的值并在Hive中执行字符串操作可能更容易。

另外请注意,因为您还没有得到XPath 2.0:translate只需要一个字符串作为第一个参数。您之前需要string-join