是否可以在countvectorizer中使用无序的双字母组合

时间:2017-03-06 15:56:17

标签: python python-3.x scikit-learn nltk

我想要无序的双字母组合例如:"猫坐在垫子上#34;

[("cat","the"),("cat","sat"),("on","sat"),("on","the"),("mat","the")]

每个二元组按字母顺序排序 - 这意味着,例如,"来自"将[("house", "to"),("house","to")]给予unordered_bigrams = [tuple(sorted(n)) for n in list(nltk.bigrams(words))],这将为这些双字母组合提供更高的频率,同时最小化搜索空间。

我可以使用上述内容:
o_bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
但我现在想要一个"词袋"这些类型的矢量。

我已经订购了bigram 特征向量,使用:
@Action(semantics = SemanticsOf.SAFE) @ActionLayout(named = "Find alphabet Soup by Letter", bookmarking = BookmarkPolicy.AS_ROOT) @MemberOrder(name = "Menu soups", sequence = "7") public List<SomeObject> findByLetter(@ParameterLayout(named = "letter") final String letter) { return container.allMatches(new QueryDefault<SoupObject>(SoupObject.class, "findSoupQuery", "letter", letter)); }

对于我的无序双字母组织也是如此...我很难在CountVectorizer中找到一个可以给我这个处理选项的选项(我已经看过词汇和预处理器而没有太多运气)< / p>

1 个答案:

答案 0 :(得分:1)

如果你需要的是一对可能的单词给出一个无序的单词列表,你真的不需要一个二元组生成器:

>>> from itertools import permutations
>>> words = set("the cat sat on the mat".split())
>>> list(permutations(words, 2))
[('on', 'the'), ('on', 'sat'), ('on', 'mat'), ('on', 'cat'), ('the', 'on'), ('the', 'sat'), ('the', 'mat'), ('the', 'cat'), ('sat', 'on'), ('sat', 'the'), ('sat', 'mat'), ('sat', 'cat'), ('mat', 'on'), ('mat', 'the'), ('mat', 'sat'), ('mat', 'cat'), ('cat', 'on'), ('cat', 'the'), ('cat', 'sat'), ('cat', 'mat')]

或者,如果您不希望重复的元组具有相同的单词但顺序不同:

>>> from itertools import combinations
>>> list(combinations(words, 2))
[('on', 'the'), ('on', 'sat'), ('on', 'mat'), ('on', 'cat'), ('the', 'sat'), ('the', 'mat'), ('the', 'cat'), ('sat', 'mat'), ('sat', 'cat'), ('mat', 'cat')]

https://stackoverflow.com/a/942551/610569上的productcombinationpermutation上有一个很好的答案