Question

我正在接受一个字符串，将其标记化，并希望查看最常见的双字母组合，这是我所拥有的：

// ***Warning: note the unusual order of (power, base) for the parameters
// *** due to the default val for the base
template <unsigned long exponent, std::uintmax_t base=10>
struct pow_struct
{
private:
  static constexpr uintmax_t at_half_pow=pow_struct<exponent / 2, base>::value;
public:
  static constexpr uintmax_t value=
      at_half_pow*at_half_pow*(exponent % 2 ? base : 1)
  ;
};

// not necessary, but will cut the recursion one step
template <std::uintmax_t base>
struct pow_struct<1, base>
{
  static constexpr uintmax_t value=base;
};


template <std::uintmax_t base>
struct pow_struct<0,base>
{
  static constexpr uintmax_t value=1;
};

如果我：

template <uint vmajor, uint vminor, uint build>
struct build_token {
  constexpr uintmax_t value=
       vmajor*pow_struct<9>::value 
     + vminor*pow_struct<6>::value 
     + build_number
  ;
}

然后它将按排序顺序输出双字母。

import nltk
import collections
from nltk import ngrams

someString="this is some text. this is some more test. this is even more text."
tokens=nltk.word_tokenize(someString)
tokens=[token.lower() for token in tokens if len()>1]

bigram=ngrams(tokens,2)
aCounter=collections.Counter(bigram)

将打印元素，但不打印计数，而不是按计数顺序打印。我想做一个for循环，我在文本中打印出X最常见的双字母。

我基本上是在尝试同时学习Python和nltk，所以这就是为什么我在这里挣扎（我认为这是一件微不足道的事情）。

Answer 1

您可能正在寻找已经存在的东西，即计数器上的most_common方法。来自文档：

返回n最常见元素及其计数的列表，从最常见到最少。如果省略n或None，most_common()将返回计数器中的所有元素。具有相同计数的元素是任意排序的：

您可以调用它并提供值n，以获取n最常见的值计数对。例如：

from collections import Counter

# initialize with silly value.
c = Counter('aabbbccccdddeeeeefffffffghhhhiiiiiii')

# Print 4 most common values and their respective count.
for val, count in c.most_common(4):
    print("Value {0} -> Count {1}".format(val, count))

打印出来：

Value f -> Count 7
Value i -> Count 7
Value e -> Count 5
Value h -> Count 4

访问包含ngrams的计数器的元素

1 个答案: