Question

是否有可能从Postgres中包含文本字符串的字段中识别每个字词和计数？

Answer 1

这样的东西？

SELECT some_pk, 
       regexp_split_to_table(some_column, '\s') as word
FROM some_table

获取不同的单词很容易：

SELECT DISTINCT word
FROM ( 
  SELECT regexp_split_to_table(some_column, '\s') as word
  FROM some_table
) t

或获取每个单词的计数：

SELECT word, count(*)
FROM ( 
  SELECT regexp_split_to_table(some_column, '\s') as word
  FROM some_table
) t
GROUP BY word

Answer 2

您也可以使用PostgreSQL文本搜索功能，例如：

SELECT * FROM ts_stat('SELECT to_tsvector(''hello dere hello hello ridiculous'')');

将产生：

  word   | ndoc | nentry 
---------+------+--------
 ridicul |    1 |      1
 hello   |    1 |      3
 dere    |    1 |      1
(3 rows)

（PostgreSQL应用依赖于语言的词干和停止词删除，这可能是你想要的，也可能不是。使用simple而不是{{1}可以禁用停用词和词干。字典，见下文。）

嵌套的english语句可以是产生tsvector列的任何select语句，因此您可以将应用SELECT函数的函数替换为任意数量的文本字段，并将它们连接成一个to_tsvector，覆盖文档的任何子集，例如：

tsvector

将生成从前500个文档的SELECT * FROM ts_stat('SELECT to_tsvector(''english'',title) || to_tsvector(''english'',body) from my_documents id < 500') ORDER BY nentry DESC;和title字段中获取的总字数的矩阵，按出现次数降序排序。对于每个单词，您还将获得它出现的文档数量（body列）。

有关详细信息，请参阅文档：http://www.postgresql.org/docs/current/static/textsearch.html

Answer 3

应在单词之间用空格''或其他分隔符号分隔;不是用's'，除非有意这样做，例如将'myWordshere'视为'myWord'和'here'。

SELECT word, count(*)
FROM ( 
  SELECT regexp_split_to_table(some_column, ' ') as word
  FROM some_table
) t
GROUP BY word

Postgres中字符串的单词频率？

3 个答案: