我有一个句子列表:
text = ['cant railway station','citadel hotel',' police stn'].
我需要形成双字节对并将它们存储在变量中。问题是,当我这样做时,我会得到一对句子而不是单词。这是我做的:
text2 = [[word for word in line.split()] for line in text]
bigrams = nltk.bigrams(text2)
print(bigrams)
产生
[(['cant', 'railway', 'station'], ['citadel', 'hotel']), (['citadel', 'hotel'], ['police', 'stn'])
火车站和城堡酒店不能组成一个二元组。我想要的是
[([cant],[railway]),([railway],[station]),([citadel,hotel]), and so on...
第一句的最后一个单词不应与第二句的第一个单词合并。 我该怎么做才能让它发挥作用?
答案 0 :(得分:33)
>>> text = ["this is a sentence", "so is this one"]
>>> bigrams = [b for l in text for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
>>> print(bigrams)
[('this', 'is'), ('is', 'a'), ('a', 'sentence'), ('so', 'is'), ('is', 'this'), ('this',
'one')]
答案 1 :(得分:6)
不是将文本转换为字符串列表,而是将每个句子分别作为字符串单独开头。我还删除了标点符号和停用词,如果与您无关,请删除这些部分:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
def get_bigrams(myString):
tokenizer = WordPunctTokenizer()
tokens = tokenizer.tokenize(myString)
stemmer = PorterStemmer()
bigram_finder = BigramCollocationFinder.from_words(tokens)
bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500)
for bigram_tuple in bigrams:
x = "%s %s" % bigram_tuple
tokens.append(x)
result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8]
return result
要使用它,请执行以下操作:
for line in sentence:
features = get_bigrams(line)
# train set here
请注意,这会更进一步,实际上统计得分为bigrams(在训练模型时会派上用场)。
答案 2 :(得分:4)
没有nltk:
ans = []
text = ['cant railway station','citadel hotel',' police stn']
for line in text:
arr = line.split()
for i in range(len(arr)-1):
ans.append([[arr[i]], [arr[i+1]]])
print(ans) #prints: [[['cant'], ['railway']], [['railway'], ['station']], [['citadel'], ['hotel']], [['police'], ['stn']]]
答案 3 :(得分:3)
from nltk import word_tokenize
from nltk.util import ngrams
text = ['cant railway station', 'citadel hotel', 'police stn']
for line in text:
token = nltk.word_tokenize(line)
bigram = list(ngrams(token, 2))
# the '2' represents bigram...you can change it to get ngrams with different size
答案 4 :(得分:1)
>>> text = ['cant railway station','citadel hotel',' police stn']
>>> bigrams = [(ele, tex.split()[i+1]) for tex in text for i,ele in enumerate(tex.split()) if i < len(tex.split())-1]
>>> bigrams
[('cant', 'railway'), ('railway', 'station'), ('citadel', 'hotel'), ('police', 'stn')]
使用枚举和拆分功能。
答案 5 :(得分:1)
只需修复Dan的代码:
def get_bigrams(myString):
tokenizer = WordPunctTokenizer()
tokens = tokenizer.tokenize(myString)
stemmer = PorterStemmer()
bigram_finder = BigramCollocationFinder.from_words(tokens)
bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500)
for bigram_tuple in bigrams:
x = "%s %s" % bigram_tuple
tokens.append(x)
result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8]
return result
答案 6 :(得分:1)
最好的方法是使用“ zip”功能来生成n-gram。 其中 2范围功能是克数
test = [1,2,3,4,5,6,7,8,9]
print(test[0:])
print(test[1:])
print(list(zip(test[0:],test[1:])))
%timeit list(zip(*[test[i:] for i in range(2)]))
o / p:
[1, 2, 3, 4, 5, 6, 7, 8, 9]
[2, 3, 4, 5, 6, 7, 8, 9]
[(1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9)]
1000000 loops, best of 3: 1.34 µs per loop
答案 7 :(得分:0)
df = pd.read_csv('dataset.csv', skiprows = 6, index_col = "No")
df["Month"] = df["Date(ET)"].apply(lambda x : x.split('/')[0])
tokens = df.groupby("Month")["Contents"].sum().apply(lambda x : x.split(' '))
bigrams = tokens.apply(lambda x : list(nk.ngrams(x, 2)))
count_bigrams = bigrams.apply(lambda x : list(x.count(item) for item in x))
month1 = pd.DataFrame(data = count_bigrams[0], index= bigrams[0], columns= ["Count"])
month2 = pd.DataFrame(data = count_bigrams[1], index= bigrams[1], columns= ["Count"])
答案 8 :(得分:0)
有种方法来解决,但我是这样解决的:
>>text = ['cant railway station','citadel hotel',' police stn']
>>text2 = [[word for word in line.split()] for line in text]
>>text2
[['cant', 'railway', 'station'], ['citadel', 'hotel'], ['police', 'stn']]
>>output = []
>>for i in range(len(text2)):
output = output+list(bigrams(text2[i]))
>>#Here you can use list comphrension also
>>output
[('cant', 'railway'), ('railway', 'station'), ('citadel', 'hotel'), ('police', 'stn')]
答案 9 :(得分:0)
我认为最好的,最通用的方法是:
n = 2
ngrams = []
for l in L:
for i in range(n,len(l)+1):
ngrams.append(l[i-n:i])
或换句话说:
ngrams = [ l[i-n:i] for l in L for i in range(n,len(l)+1) ]
这应该适用于任何n
和任何序列l
。如果没有长度为n
的ngram,则返回空列表。