假设我有这样的文字:
'he is hdajs asdas da he is not asd as da s i am a da daas you am a'
我已经从这段文字创建了所有的文章:
>>> bigrams_
[('he', 'is'), ('is', 'hdajs'), ('hdajs', 'asdas'), ('asdas', 'da'), ('da', 'he'), ('he', 'is'), ('is', 'not'), ('not', 'asd'), ('asd', 'as'), ('as', 'da'), ('da', 's'), ('s', 'i'), ('i', 'am'), ('am', 'a'), ('a', 'da'), ('da', 'daas'), ('daas', 'you'), ('you', 'am'), ('am', 'a')]
现在我想创建一个新的双字母组列表,其中每个双字母组的第一个元素是一个位置索引,显示上述格式的某个二元组在文本中被观察到的次数,以及第二个元素将是初始列表中的二元组。例如,在上面的列表中,最后一个元素('am', 'a')
已经被看过2次,所以在新列表中它将对应于这个二元组:(2, ('am', 'a'))
。
什么是简洁的Pythonic方式做到这一点。
答案 0 :(得分:3)
您可以使用默认值为defaultdict
对象的count
,并逐步获取该键计数器的next
值,例如:
from collections import defaultdict
from itertools import count
dd = defaultdict(lambda: count(1))
bigrams = [('he', 'is'), ('is', 'hdajs'), ('hdajs', 'asdas'), ('asdas', 'da'), ('da', 'he'), ('he', 'is'), ('is', 'not'), ('not', 'asd'), ('asd', 'as'), ('as', 'da'), ('da', 's'), ('s', 'i'), ('i', 'am'), ('am', 'a'), ('a', 'da'), ('da', 'daas'), ('daas', 'you'), ('you', 'am'), ('am', 'a')]
with_count = [(next(dd[bigram]), bigram) for bigram in bigrams]
给你:
[(1, ('he', 'is')),
(1, ('is', 'hdajs')),
(1, ('hdajs', 'asdas')),
(1, ('asdas', 'da')),
(1, ('da', 'he')),
(2, ('he', 'is')),
(1, ('is', 'not')),
(1, ('not', 'asd')),
(1, ('asd', 'as')),
(1, ('as', 'da')),
(1, ('da', 's')),
(1, ('s', 'i')),
(1, ('i', 'am')),
(1, ('am', 'a')),
(1, ('a', 'da')),
(1, ('da', 'daas')),
(1, ('daas', 'you')),
(1, ('you', 'am')),
(2, ('am', 'a'))]
答案 1 :(得分:0)
你可以试试这个:
s = 'he is hdajs asdas da he is not asd as da s i am a da daas you am a'
s1 = s.split()
new_data = list(set([(s.count(' '.join(b)), b) for b in [(s1[i], s1[i+1]) for i in range(len(s1)-1)]]))
输出:
[(2, ('am', 'a')), (1, ('da', 'daas')), (1, ('not', 'asd')), (1, ('s', 'i')), (1, ('da', 'he')), (1, ('you', 'am')), (2, ('he', 'is')), (1, ('is', 'not')), (1, ('asdas', 'da')), (1, ('asd', 'as')), (1, ('hdajs', 'asdas')), (1, ('a', 'da')), (1, ('daas', 'you')), (2, ('as', 'da')), (1, ('da', 's')), (1, ('is', 'hdajs')), (1, ('i', 'am'))]
答案 2 :(得分:0)
我喜欢@JonClements itertools count
- 基于解决方案(+1),但我不认为defaultdict
是必要的:
from itertools import count
text = 'he is hdajs asdas da he is not asd as da s i am a da daas you am a'
words = text.split()
bigrams = zip(words, words[1:])
seen = dict()
result = [(next(seen.setdefault(bigram, count(1))), bigram) for bigram in bigrams]
print(*result, sep='\n')
输出
(1, ('he', 'is'))
(1, ('is', 'hdajs'))
(1, ('hdajs', 'asdas'))
(1, ('asdas', 'da'))
(1, ('da', 'he'))
(2, ('he', 'is'))
(1, ('is', 'not'))
(1, ('not', 'asd'))
(1, ('asd', 'as'))
(1, ('as', 'da'))
(1, ('da', 's'))
(1, ('s', 'i'))
(1, ('i', 'am'))
(1, ('am', 'a'))
(1, ('a', 'da'))
(1, ('da', 'daas'))
(1, ('daas', 'you'))
(1, ('you', 'am'))
(2, ('am', 'a'))