要分析文本,我们将其转换为单词列表P1。然后,我们应用Bigram方法并获得单词对(ai,bi)的列表X,这样ai和bi就会在P1中很多次出现。如何在Python 3中从P1获取一个列表P2,以便如果两个项ai和bi在P1中一个接一个,并且从X的(ai,bi)被一个元素ai_bi替换? 我的最终目标是将文本准备为单词列表,以便在Word2Vec中进行分析。 我有自己的代码,并且可以运行,但是我认为在大文本上会很慢。
import nltk
from nltk.collocations import *
import re
import gensim
bigram_measures = nltk.collocations.BigramAssocMeasures()
sentences=["Total internal reflection ! is the;phenomenon",
"Abrasive flow machining :is an ? ( interior surface finishing process)",
"Technical Data[of Electrical Discharge wire cutting and] Cutting Machine",
"The greenhouse effect. is the process by which, radiation from a {planet atmosphere warms }the planet surface",
"Absolute zero!is the lowest limit ;of the thermodynamic temperature scale:",
"The term greenhouse effect ?is mentioned (a lot)",
"[An interesting] effect known as total internal reflection.",
"effect on impact energies ,Electrical discharge wire cutting of ADI",
"{Absolute zero represents} the coldest possible temperature",
"total internal reflection at an air water interface",
"What is Electrical Discharge wire cutting Machining and how does it work",
"Colder than Absolute Zero",
"A Mathematical Model for Electrical Discharge Wire Cutting Machine Parameters"]
P1=[]
for f in sentences:
f1=gensim.utils.simple_preprocess (f.lower())
P1.extend(f1)
print("First 100 items from P1")
print(P1[:100])
# bigram
finder = BigramCollocationFinder.from_words(P1)
# filter only bigrams that appear 2+ times
finder.apply_freq_filter(2)
# return the all bi-grams with the highest PMI
X=finder.nbest(bigram_measures.pmi, 10000)
print()
print("Number of bigrams= ",len(X))
print("10 first bigrams with the highest PMI")
print(X[:10])
# replace ai and bi which are one after another in P1 and (ai,bi) in X =>> with ai_bi
P2=[]
n=len(P1)
i=0
while i<n:
P2.append(P1[i])
if i<n-2:
for c in X:
if c[0]==P1[i] and c[1]==P1[i+1]:
P2[len(P2)-1]=c[0]+"_"+c[1]
i+=1 # skip second item of couple from X
break
i+=1
print()
print( "first 50 items from P2 - results")
print(P2[:50])
答案 0 :(得分:2)
我想您正在寻找类似的东西。
P2 = []
prev = P1[0]
for this in P1[1:]:
P2.append(prev + "_" + this)
prev = this
这实现了一个简单的滑动窗口,其中将先前的令牌粘贴到当前令牌的旁边。