蜂巢句子功能如何打破每个句子

时间:2017-01-04 15:19:31

标签: hive bigdata

在发帖之前,我尝试了蜂巢语句功能并进行了一些搜索,但未能得到清晰的理解,我的问题是基于什么分隔符蜂巢句子功能打破每个句子?蜂巢手册说“适当的边界”是什么意思?下面是我尝试的一个例子,我尝试在句子的不同点添加句点(。)和感叹号(!)。我得到了不同的输出,有人可以解释一下吗?

与句点(。)

select sentences('Tokenizes a string of natural language text into words and sentences. where each sentence is broken at the appropriate sentence boundary and returned as an array of words.') from dummytable

输出 - 1个数组

[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences","where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]

与'!'

select sentences('Tokenizes a string of natural language text into words and sentences! where each sentence is broken at the appropriate sentence boundary and returned as an array of words.') from dummytable

输出 - 2个数组

[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences"],["where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]

2 个答案:

答案 0 :(得分:1)

如果您了解句子()的功能,则会清除您的疑问。

句子的定义(str):

  

将str拆分为句子数组,其中每个句子都是一个数组   的话。

示例:

SELECT sentences('Hello there! I am a UDF.') FROM src LIMIT 1;

[ ["Hello", "there"], ["I", "am", "a", "UDF"] ]



SELECT sentences('review . language') FROM movies;

[["review","language"]]

感叹号是一种在句子末尾出现的标点符号。相关标点符号的其他示例包括句点和问号,也在句子的末尾。但是根据 sentences() 的定义,不必要的标点符号,例如<中的句点和逗号自动剥离强>英语。所以,我们可以得到两个单词数组!它完全涉及 java.util.Locale.java

答案 1 :(得分:0)

我不知道实际原因,但是在period(。)之后观察到,如果您将空格和下一个单词的第一个字母作为大写字母,那么它就起作用了。 在这里,我从工作地点更改为工作地点。但是,这不是必需的!

Tokenizes a string of natural language text into words and sentences. Where each sentence is broken at the appropriate sentence boundary and returned as an array of words.

这是下面的输出

[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences"],["Where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]