我有一个存储在配置单元日志表中的XML blob(如下所示)。
<user>
<uid>1424324325</uid>
<attribs>
<field>
...
</field>
<field>
<name>first</name>
<value>Joh,n</value>
</field>
<field>
...
</field>
<field>
<name>last</name>
<value>D,oe</value>
</field>
<field>
...
</field>
</attribs>
</user>
hive表中的每一行都有关于不同用户的信息,我想提取uid,名字和姓氏的值(从名称中删除任何逗号)。
1424324325 John Doe
1424435463 Jane Smith
我能够从XML中提取值。
SELECT uid, fn, ln
FROM log_table
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/uid/text()')) uids as uid
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "first_name"]/value/text()')) fns as fn
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "last_name"]/value/text()')) lns as ln;
然而,我正试图从第一个名字中删除不必要的逗号(如果它们存在)。姓。
当我尝试使用下面显示的任何方法提取名字时,结果为空。
LATERAL VIEW explode(xpath(logs['users_updates'], '/users/attribs/field[name = "first_name"]/value/replace(text(),",","")')) fns as fn
LATERAL VIEW explode(xpath(logs['users_updates'], '/users/attribs/field[name = "first_name"]/value/translate(text(),",","")')) fns as fn
当我按下图所示尝试时,替换有关无效功能的抱怨,而翻译会在不删除额外逗号的情况下提取数据。
LATERAL VIEW explode(xpath(logs['users_updates'], replace('/subscriberUpdates/updates/field[name = "first_name"]/value/text()',",",""))) fns as fn
LATERAL VIEW explode(xpath(logs['users_updates'], translate('/subscriberUpdates/updates/field[name = "first_name"]/value/text()',",",""))) fns as fn
如何在名称值中没有逗号的情况下提取信息?
1424324325 John Doe
1424435463 Jane Smith
最终解决方案: 这是Jens建议之后的最终工作查询
SELECT uid, regexp_replace(fn,","," ") as fname, regexp_replace(ln,","," ") as lname
FROM log_table
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/uid/text()')) uids as uid
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "first_name"]/value/text()')) fns as fn
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[name = "last_name"]/value/text()')) lns as ln;
答案 0 :(得分:1)
Hive中不支持XPath 2.0。这会影响你的问题两次:
//value/translate(text(), ',', '')
(为每个<value/>
元素调用translate)是有效的XPath 2.0,但您无法在XPath 1.0中执行此操作。另一方面,translate(//value, ',', '')
返回连接为单个字符串的所有<value/>
个项目中的所有文本节点。replace
函数。传递包含逗号的值并在Hive中执行字符串操作可能更容易。
另外请注意,因为您还没有得到XPath 2.0:translate
只需要一个字符串作为第一个参数。您之前需要string-join
。