我希望能够从配置单元表中的域名中删除后缀。 因为有各种类型的后缀(.co.uk,.com,.ac.uk,...),我想知道最好的方法是什么?
答案 0 :(得分:1)
听起来您想要隔离域名的最后两个部分,可以使用regexp_extract
来完成。使用您发布的数据部分作为具体示例:
hive> DESCRIBE dns_data;
OK
fullname string
Time taken: 0.871 seconds, Fetched: 1 row(s)
hive> SELECT * FROM dns_data;
OK
a632.phobos.apple.com
a1554.phobos.apple.com
a1207.phobos.apple.com
a1877.phobos.apple.com
lnll.googlevideo.com
a8-19.clients.cdn13.com
r6---sn-aigllnzk.googlevideo.com
r1---sn-cg07lues.googlevideo.com
r4---sn-aiglln6k.googlevideo.com
r18---sn-aigllnel.googlevideo.com
r11---sn-aigllnly.googlevideo.com
r20---sn-aigllnsk.googlevideo.com
r2---sn-aigllnee.googlevideo.com
r14---sn-aigllnld.googlevideo.com
r19---sn-aigllnll.googlevideo.com
r14---sn-aigllnsd.googlevideo.com
Time taken: 0.589 seconds, Fetched: 16 row(s)
可以按以下方式使用 regexp_extract
来隔离最后两个字符串:
hive> SELECT
> stripped,
> suffix,
> split(suffix, "\\.") AS suffix_split
> FROM (SELECT
> regexp_extract(fullname, "^(.*?)\\.((?:[^.]+\\.?){2})$", 1) AS stripped,
> regexp_extract(fullname, "^(.*?)\\.((?:[^.]+\\.?){2})$", 2) AS suffix
> FROM dns_data
> ) t;
...
... Lots of MapReduce spam
...
OK
a632.phobos apple.com ["apple","com"]
a1554.phobos apple.com ["apple","com"]
a1207.phobos apple.com ["apple","com"]
a1877.phobos apple.com ["apple","com"]
lnll googlevideo.com ["googlevideo","com"]
a8-19.clients cdn13.com ["cdn13","com"]
r6---sn-aigllnzk googlevideo.com ["googlevideo","com"]
r1---sn-cg07lues googlevideo.com ["googlevideo","com"]
r4---sn-aiglln6k googlevideo.com ["googlevideo","com"]
r18---sn-aigllnel googlevideo.com ["googlevideo","com"]
r11---sn-aigllnly googlevideo.com ["googlevideo","com"]
r20---sn-aigllnsk googlevideo.com ["googlevideo","com"]
r2---sn-aigllnee googlevideo.com ["googlevideo","com"]
r14---sn-aigllnld googlevideo.com ["googlevideo","com"]
r19---sn-aigllnll googlevideo.com ["googlevideo","com"]
r14---sn-aigllnsd googlevideo.com ["googlevideo","com"]
Time taken: 22.073 seconds, Fetched: 16 row(s)