将URL拆分为单词数组后获得单词计数

时间:2019-06-14 12:22:15

标签: sql arrays hive hiveql regexp-replace

我有一个带有URL列表的表

url

http://03cubsml.baseball.cbssports.com/stats/stats-main?selectedplayer=2122997
http://08flb.baseball.cbssports.com/scoring/standard
http://100-poems.com/poems/life/index2.htm
http://10000lakesrbl.baseball.cbssports.com/stats/stats-main
http://1000pictures.com/view.htm?cscenic/sunset+fnoy-2011-07-21-211010+a1112212325323435434553545885949hh9
http://05command.wikidot.com/tech-hub-tag-list
http://10000lakesrbl.baseball.cbssports.com/players/playerpage/2504134
http://1001goroskop.ru/gadanie/?kniga-sudeb
http://04spfbl.baseball.cbssports.com/standings/overall
http://05command.wikidot.com
http://05command.wikidot.com/tech-hub-tag
http://05fbl.baseball.cbssports.com/stats/stats-main
http://100-poems.com/poems/life/0464004.htm
http://10000islands.proboards.com/board/129/tito-headquarters
http://10000islands.proboards.com/thread/11959/tip-islands-party?page=477
http://10000islands.proboards.com/thread/14172/illustrious-house-improving-wordiness?page=82
http://1000pictures.com/view.htm?cscenic/sunset+feilat05-040+a1112212325323435434553545885949hh9
http://1001-rimes.com/listeperson.php?letter=%E9&start=30
http://1001-rimes.com/listeperson.php?letter=ques&start=30
http://1001goroskop.ru/?god

我现在使用以下代码将URL拆分为URL中存在的单词列表

Create table url_keyword
(url string,
keywords Array<String>);

Insert Overwrite table url_keyword
as
Select url,split(lcase (parse_url (url,'PATH')),"[=/_%:|^$#@!&,?*_~+.`<>(){}' \-\;\" \\ \\[\\]{[0 -9]+ }]") AS keywords from url_table;

我得到的输出具有通过拆分数组生成的url和关键字(空格分隔的数组)。现在,我想获取每个网址生成的字数,但是每当我尝试执行

时,
regexp_replace(keywords,' ',',') 

将其转换为逗号分隔的数组,以便我可以使用长度函数来获取字数,但会出现错误

Wrong arguments '','': No matching method for class org.apache.hadoop.hive.ql.udf.UDFRegExpReplace with (array, string, string). Possible choices: _FUNC_(string, string, string)

在这种情况下如何实现字数统计?

我的关键字输出看起来像

 stats stats main
 scoring standard
 poems life index  htm
 stats stats main
 view htm
 tech hub tag list
 players playerpage        
 gadanie 
 standings overall

 tech hub tag
 stats stats main
 poems life         htm
 board     tito headquarters
 thread       tip islands party
 thread       illustrious house improving wordiness
 view htm
 listeperson php
 listeperson php

0 个答案:

没有答案