Question

i started working with hive to do some data preparation and ran into a peculiar problem when using the regexp_extract udf. I am working on a XML structures and i am trying to extract some elements from a XML-string. Here is an example. The string i am operating on is:

<b>ajsdnf</b>
<a>asdhf</a>
<a>alfnv</a>
<b>ajsdnf</b>
<a>test</a>

The regular expression (<a>.*?<\/a>) should extract all strings that contains only the elements with the a tags. When i check my logic on regex101 it finds all the right groups.

However when i run it against hive like this

select regexp_extract('<b>ajsdnf</b><a>asdhf</a><a>alfnv</a><b>ajsdnf</b><a>test</a>','(<a>.*?<\/a>)',0) from some_table limit 1;

it only returns the first <a>asdhf</a>. According to the documentation of regex_extract it should return all occurrences if i pass the integer 0 as 3rd parameter. Is there any chance i can achieve the following result

<a>asdhf</a>
<a>alfnv</a>
<a>test</a>

And if you are wondering why i am not using xpath to deal with this XML problem, i am having a much more complex structure and want to extract certain parts of the XML tree including all their child nodes. That is something the xpath udfs of hive cannot handle at the moment.

Answer 1

select regexp_replace('<b>ajsdnf</b><a>a<b>aksdhf</b>dhf</a><a>alfnv</a><b>ajsdnf</b><a>test</a>','(.*?)(<a>.*?<\/a>)(.*?)','$2') from some_tablelimit 1;

这就是诀窍。感谢nhahtdh的建议

regexp_extract hive not working as expected

1 个答案: