Question

我正在尝试在文本中绘制n-gram重复的位置和频率。这个想法是识别文本中的点，作者开始重用这些术语。某些类型的独特性应该比其他类型更短。

Word 1 ... n，放在X轴上。随着n-gram复发的频率变为> 1，图中出现一个点，其中X是其位置，Y是频率，颜色是唯一的n-gram。从下面的代码中，2克“好运动”将被绘制为（7,2，RED）。

问：如何使用1.unique n-gram创建np.array，2。频率和3.文本中的位置？

    from sklearn.feature_extraction.text import CountVectorizer 
    import nltk

    words = "good day, good night, good sport, good sport charlie"
    clean=re.sub("[^\w\d'\s]+",'',words)

    vectorizer2 = CountVectorizer( ngram_range=(2,2),         tokenizer=word_tokenize, stop_words='english')
    analyzer = vectorizer2.build_analyzer()
    two_grams=analyzer(clean)


    # Get the set of unique words.
    uniques = []
    for word in two_grams:
        if word not in uniques:
            uniques.append(word)

    # Make a list of (count, unique) tuples.
    counts = []
    for unique in uniques:
        count = 0              # Initialize the count to zero.
        for word in two_grams:     # Iterate over the words.
            if word == unique:   # Is this word equal to the current unique?
                count += 1         # If so, increment the count
        counts.append((count, unique))


    counts.sort()            # Sorting the list puts the lowest counts first.
    counts.reverse()         # Reverse it, putting the highest counts first.
    # Print the ten words with the highest counts.
    for i in range(min(10, len(counts))):
        count, word = counts[i]
        print('%s %d' % (word, count))

    #Scatterplot

    #plt.scatter(count, count, s=area, c=colors, alpha=0.5)
   ####plt.show()

Answer 1

我认为在你尝试scikit学习之前，最好从一个简单而深思熟虑的算法开始。

如果是你的engram，它可以迭代字符串的字符，直到第一个字符匹配，然后

ngram = "good sport"
words = "good day, good night, good sport, good sport charlie"

loc = []
k = 0
while k < len(words):
    if words[k] != ngram[0]:
        k += 1
    elif words[k: k+ len(ngram)] == ngram:
        loc += [k]
        k += 1
    else:
        k += 1

返回loc = [22,34]。任何列表都可以变成数组，例如L = np.array(loc)。

没有声称这是有效的或防止失败的，但是了解你的编码是什么样的scikit-learn

更好

问如何在文本中绘制n-gram位置

1 个答案: