我正在尝试在文本中绘制n-gram重复的位置和频率。这个想法是识别文本中的点,作者开始重用这些术语。某些类型的独特性应该比其他类型更短。
Word 1 ... n,放在X轴上。随着n-gram复发的频率变为> 1,图中出现一个点,其中X是其位置,Y是频率,颜色是唯一的n-gram。从下面的代码中,2克“好运动”将被绘制为(7,2,RED)。
问:如何使用1.unique n-gram创建np.array,2。频率和3.文本中的位置?
from sklearn.feature_extraction.text import CountVectorizer
import nltk
words = "good day, good night, good sport, good sport charlie"
clean=re.sub("[^\w\d'\s]+",'',words)
vectorizer2 = CountVectorizer( ngram_range=(2,2), tokenizer=word_tokenize, stop_words='english')
analyzer = vectorizer2.build_analyzer()
two_grams=analyzer(clean)
# Get the set of unique words.
uniques = []
for word in two_grams:
if word not in uniques:
uniques.append(word)
# Make a list of (count, unique) tuples.
counts = []
for unique in uniques:
count = 0 # Initialize the count to zero.
for word in two_grams: # Iterate over the words.
if word == unique: # Is this word equal to the current unique?
count += 1 # If so, increment the count
counts.append((count, unique))
counts.sort() # Sorting the list puts the lowest counts first.
counts.reverse() # Reverse it, putting the highest counts first.
# Print the ten words with the highest counts.
for i in range(min(10, len(counts))):
count, word = counts[i]
print('%s %d' % (word, count))
#Scatterplot
#plt.scatter(count, count, s=area, c=colors, alpha=0.5)
####plt.show()
答案 0 :(得分:0)
我认为在你尝试scikit学习之前,最好从一个简单而深思熟虑的算法开始。
如果是你的engram,它可以迭代字符串的字符,直到第一个字符匹配,然后
ngram = "good sport"
words = "good day, good night, good sport, good sport charlie"
loc = []
k = 0
while k < len(words):
if words[k] != ngram[0]:
k += 1
elif words[k: k+ len(ngram)] == ngram:
loc += [k]
k += 1
else:
k += 1
返回loc = [22,34]
。任何列表都可以变成数组,例如L = np.array(loc)
。
没有声称这是有效的或防止失败的,但是了解你的编码是什么样的scikit-learn
更好