Question

我想要无序的双字母组合例如：＆＃34;猫坐在垫子上＃34;

[("cat","the"),("cat","sat"),("on","sat"),("on","the"),("mat","the")]

每个二元组按字母顺序排序 - 这意味着，例如，＆＃34;来自＆＃34;将[("house", "to"),("house","to")]给予unordered_bigrams = [tuple(sorted(n)) for n in list(nltk.bigrams(words))]，这将为这些双字母组合提供更高的频率，同时最小化搜索空间。

我可以使用上述内容：
o_bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
但我现在想要一个＆＃34;词袋＆＃34;这些类型的矢量。

我已经订购了bigram 特征向量，使用：
@Action(semantics = SemanticsOf.SAFE) @ActionLayout(named = "Find alphabet Soup by Letter", bookmarking = BookmarkPolicy.AS_ROOT) @MemberOrder(name = "Menu soups", sequence = "7") public List<SomeObject> findByLetter(@ParameterLayout(named = "letter") final String letter) { return container.allMatches(new QueryDefault<SoupObject>(SoupObject.class, "findSoupQuery", "letter", letter)); }

对于我的无序双字母组织也是如此...我很难在CountVectorizer中找到一个可以给我这个处理选项的选项（我已经看过词汇和预处理器而没有太多运气）< / p>

Answer 1

如果你需要的是一对可能的单词给出一个无序的单词列表，你真的不需要一个二元组生成器：

>>> from itertools import permutations
>>> words = set("the cat sat on the mat".split())
>>> list(permutations(words, 2))
[('on', 'the'), ('on', 'sat'), ('on', 'mat'), ('on', 'cat'), ('the', 'on'), ('the', 'sat'), ('the', 'mat'), ('the', 'cat'), ('sat', 'on'), ('sat', 'the'), ('sat', 'mat'), ('sat', 'cat'), ('mat', 'on'), ('mat', 'the'), ('mat', 'sat'), ('mat', 'cat'), ('cat', 'on'), ('cat', 'the'), ('cat', 'sat'), ('cat', 'mat')]

或者，如果您不希望重复的元组具有相同的单词但顺序不同：

>>> from itertools import combinations
>>> list(combinations(words, 2))
[('on', 'the'), ('on', 'sat'), ('on', 'mat'), ('on', 'cat'), ('the', 'sat'), ('the', 'mat'), ('the', 'cat'), ('sat', 'mat'), ('sat', 'cat'), ('mat', 'cat')]

在https://stackoverflow.com/a/942551/610569上的product，combination和permutation上有一个很好的答案

是否可以在countvectorizer中使用无序的双字母组合

1 个答案: