我想要无序的双字母组合例如:"猫坐在垫子上#34;
[("cat","the"),("cat","sat"),("on","sat"),("on","the"),("mat","the")]
每个二元组按字母顺序排序 - 这意味着,例如,"来自"将[("house", "to"),("house","to")]
给予unordered_bigrams = [tuple(sorted(n)) for n in list(nltk.bigrams(words))]
,这将为这些双字母组合提供更高的频率,同时最小化搜索空间。
我可以使用上述内容:
o_bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
但我现在想要一个"词袋"这些类型的矢量。
我已经订购了bigram 特征向量,使用:
@Action(semantics = SemanticsOf.SAFE)
@ActionLayout(named = "Find alphabet Soup by Letter", bookmarking = BookmarkPolicy.AS_ROOT)
@MemberOrder(name = "Menu soups", sequence = "7")
public List<SomeObject> findByLetter(@ParameterLayout(named = "letter") final String letter) {
return container.allMatches(new QueryDefault<SoupObject>(SoupObject.class, "findSoupQuery", "letter", letter));
}
对于我的无序双字母组织也是如此...我很难在CountVectorizer中找到一个可以给我这个处理选项的选项(我已经看过词汇和预处理器而没有太多运气)< / p>
答案 0 :(得分:1)
如果你需要的是一对可能的单词给出一个无序的单词列表,你真的不需要一个二元组生成器:
>>> from itertools import permutations
>>> words = set("the cat sat on the mat".split())
>>> list(permutations(words, 2))
[('on', 'the'), ('on', 'sat'), ('on', 'mat'), ('on', 'cat'), ('the', 'on'), ('the', 'sat'), ('the', 'mat'), ('the', 'cat'), ('sat', 'on'), ('sat', 'the'), ('sat', 'mat'), ('sat', 'cat'), ('mat', 'on'), ('mat', 'the'), ('mat', 'sat'), ('mat', 'cat'), ('cat', 'on'), ('cat', 'the'), ('cat', 'sat'), ('cat', 'mat')]
或者,如果您不希望重复的元组具有相同的单词但顺序不同:
>>> from itertools import combinations
>>> list(combinations(words, 2))
[('on', 'the'), ('on', 'sat'), ('on', 'mat'), ('on', 'cat'), ('the', 'sat'), ('the', 'mat'), ('the', 'cat'), ('sat', 'mat'), ('sat', 'cat'), ('mat', 'cat')]
在https://stackoverflow.com/a/942551/610569上的product
,combination
和permutation
上有一个很好的答案