尝试从给定特定开始和结束模式的字符串中提取文本。
真的不知道从哪里开始。我四处张望,试图从正则表达式功能中脱颖而出,但它们让我望而却步。
表格:
+----+------------------------------------+
| id | sentence |
+----+------------------------------------+
| 1 | Hello, I am a bird. |
| 2 | Hello, I am a cat. I like catfood. |
| 3 | Hello, I am a dog. I like bones. |
+----+------------------------------------+
尝试提取Hello,
和.
之间的文本
输出:
+-------------+
| sentence |
+-------------+
| I am a bird |
| I am a cat |
| I am a dog |
+-------------+
答案 0 :(得分:2)
在蜂巢中尝试使用regexp_extract(col,regexp,capture_group)
功能:
Hello, //match "Hello," literal
([^.]*) //then until first occurrence of .(period) capture as first group
示例:
hive> select regexp_extract(sentence,"Hello,([^.]*)",1)sentence from(
--preparing sample data
select stack(3,'Hello, I am a bird.','Hello, I am a cat. I like catfood.','Hello, I am a dog. I like bones.')
as(sentence))t;
结果:
sentence
I am a bird
I am a cat
I am a dog