Question

我在Python，Perl等中发现了各种ngrams实现，但我真的很喜欢bash脚本中的东西。我遇到了“Missing textutils”版本，但是只列出了ngrams，它不按频率计算它们，这对于使用ngrams来说是相当重要的 - 或者至少是我的用法。我只想要一个基本的结果列表及其频率，就像这样......

17 blue car
14 red car
5  and the
2  brown monkey
1  orange car

任何人都有这样的东西，他们可以张贴？谢谢！

Answer 1

这是一个纯粹的bash实现。您需要使用版本的bash＆gt; = 4.2并支持关联数组。

#!/usr/bin/env bash

((n=${1:-0})) || exit 1

declare -A ngrams

while read -ra line; do
        for ((i = 0; i < ${#line[@]}; i++)); do
                ((ngrams[${line[@]:i:n}]++))
        done
done 

for i in "${!ngrams[@]}"; do
        printf '%d\t%s\n' "${ngrams[$i]}" "$i"
done

另存为ngram并使用ngram 2 < file。

Answer 2

是的，ngrams可以用bash实现。

# Usage: ngrams N < FILE
ngrams () { 
  local N=$1
  local line
  set --
  while read line; do
    set -- $* $line
    while [[ -n ${*:$N} ]]; do
      echo ${*:1:$N}
      shift
    done
  done |
  sort | uniq -c
}

$ ngrams 2
Here is some text, and here is
some more text, and here is yet
some more text
  1 Here is
  2 and here
  2 here is
  2 is some
  1 is yet
  1 more text
  1 more text,
  2 some more
  1 some text,
  2 text, and
  1 yet some

注意：以上是功能，而不是脚本（也许这个question有帮助，或者可能还有另一个更好的）。这是脚本版本：

#!/bin/bash
# Usage: ngrams N < FILE
N=$1
set --
while read line; do
  set -- $* $line
  while [[ -n ${*:$N} ]]; do
    echo ${*:1:$N}
    shift
  done
done |
sort | uniq -c

可以用bash生成ngrams吗？

2 个答案: