我在使用NLTK 3.2的Python 3.6中工作。
我正在尝试编写一个程序,该程序将原始文本作为输入,并输出以相同字母开头的任何(最大)连续单词系列(即,指示序列)。
在搜索序列时,我想忽略某些单词和标点符号(例如,'it','that','into','s',','和'。'),但要包括他们在输出中。
例如,输入
"The door was ajar. So it seems that Sam snuck into Sally's subaru."
应该产生
["so", "it", "seems", "that", "sam", "snuck", "into", "sally's", "subaru"]
我是编程新手,我能想到的最好的是:
import nltk
from nltk import word_tokenize
raw = "The door was ajar. So it seems that Sam snuck into Sally's subaru."
tokened_text = word_tokenize(raw) #word tokenize the raw text with NLTK's word_tokenize() function
tokened_text = [w.lower() for w in tokened_text] #make it lowercase
for w in tokened_text: #for each word of the text
letter = w[0] #consider its first letter
allit_str = []
allit_str.append(w) #add that word to a list
pos = tokened_text.index(w) #let "pos" be the position of the word being considered
for i in range(1,len(tokened_text)-pos): #consider the next word
if tokened_text[pos+i] in {"the","a","an","that","in","on","into","it",".",",","'s"}: #if it's one of these
allit_str.append(tokened_text[pos+i]) #add it to the list
i=+1 #and move on to the next word
elif tokened_text[pos+i][0] == letter: #or else, if the first letter is the same
allit_str.append(tokened_text[pos+i]) #add the word to the list
i=+1 #and move on to the next word
else: #or else, if the letter is different
break #break the for loop
if len(allit_str)>=2: #if the list has two or more members
print(allit_str) #print it
输出
['ajar', '.']
['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
['snuck', 'into', 'sally', "'s", 'subaru', '.']
['sally', "'s", 'subaru', '.']
['subaru', '.']
这接近我想要的,除了我不知道如何限制程序只打印最大序列。
所以我的问题是:
['so', 'it', 'seems', 'that', 'sam', 'snuck', 'into', 'sally', "'s", 'subaru', '.']
?以下是其他地方提出的类似问题,但这些问题没有帮助我修改我的代码:
(我也认为在这个网站上回答这个问题会很好。)
答案 0 :(得分:2)
有趣的任务。就个人而言,我会在不使用索引的情况下循环,跟踪前一个单词,将其与当前单词进行比较。
此外,仅仅比较字母是不够的;你必须考虑到's'和'sh'等不要抄袭。这是我的尝试:
import nltk
from nltk import word_tokenize
from nltk import sent_tokenize
from nltk.corpus import stopwords
import string
from collections import defaultdict, OrderedDict
import operator
raw = "The door was ajar. So it seems that Sam snuck into Sally's subaru. She seems shy sometimes. Someone save Simon."
# Get the English alphabet as a list of letters
letters = [letter for letter in string.ascii_lowercase]
# Here we add some extra phonemes that are distinguishable in text.
# ('sailboat' and 'shark' don't alliterate, for instance)
# Digraphs go first as we need to try matching these before the individual letters,
# and break out if found.
sounds = ["ch", "ph", "sh", "th"] + letters
# Use NLTK's built in stopwords and add "'s" to them
stopwords = stopwords.words('english') + ["'s"] # add extra stopwords here
stopwords = set(stopwords) # sets are MUCH faster to process
sents = sent_tokenize(raw)
alliterating_sents = defaultdict(list)
for sent in sents:
tokenized_sent = word_tokenize(sent)
# Create list of alliterating word sequences
alliterating_words = []
previous_initial_sound = ""
for word in tokenized_sent:
for sound in sounds:
if word.lower().startswith(sound): # only lowercasing when comparing retains original case
initial_sound = sound
if initial_sound == previous_initial_sound:
if len(alliterating_words) > 0:
if previous_word == alliterating_words[-1]: # prevents duplication in chains of more than 2 alliterations, but assumes repetition is not alliteration)
alliterating_words.append(word)
else:
alliterating_words.append(previous_word)
alliterating_words.append(word)
else:
alliterating_words.append(previous_word)
alliterating_words.append(word)
break # Allows us to treat sh/s distinctly
# This needs to be at the end of the loop
# It sets us up for the next iteration
if word not in stopwords: # ignores stopwords for the purpose of determining alliteration
previous_initial_sound = initial_sound
previous_word = word
alliterating_sents[len(alliterating_words)].append(sent)
sorted_alliterating_sents = OrderedDict(sorted(alliterating_sents.items(), key=operator.itemgetter(0), reverse=True))
# OUTPUT
print ("A sorted ordered dict of sentences by number of alliterations:")
print (sorted_alliterating_sents)
print ("-" * 15)
max_key = max([k for k in sorted_alliterating_sents]) # to get sent with max alliteration
print ("Sentence(s) with most alliteration:", sorted_alliterating_sents[max_key])
这会产生一个排序有序的句子字典,其头韵计数作为其键。 max_key
变量包含最高的一个或多个句子的计数,并且可用于访问句子本身。
答案 1 :(得分:0)
可接受的答案非常全面,但是我建议使用卡内基梅隆大学的发音词典。这部分是因为它使生活更轻松,另一部分是因为不一定是字母到字母的相同发音音节也被认为是重复词。我在网上(https://examples.yourdictionary.com/alliteration-examples.html)发现的一个例子是“ 芬恩因菲比而倒”。
# nltk.download('cmudict') ## download CMUdict for phoneme set
# The phoneme dictionary consists of ARPABET which encode
# vowels, consonants, and a representitive stress-level (wiki/ARPABET)
phoneme_dictionary = nltk.corpus.cmudict.dict()
stress_symbols = ['0', '1', '2', '3...', '-', '!', '+', '/',
'#', ':', ':1', '.', ':2', '?', ':3']
# nltk.download('stopwords') ## download stopwords (the, a, of, ...)
# Get stopwords that will be discarded in comparison
stopwords = nltk.corpus.stopwords.words("english")
# Function for removing all punctuation marks (. , ! * etc.)
no_punct = lambda x: re.sub(r'[^\w\s]', '', x)
def get_phonemes(word):
if word in phoneme_dictionary:
return phoneme_dictionary[word][0] # return first entry by convention
else:
return ["NONE"] # no entries found for input word
def get_alliteration_level(text): # alliteration based on sound, not only letter!
count, total_words = 0, 0
proximity = 2 # max phonemes to compare to for consideration of alliteration
i = 0 # index for placing phonemes into current_phonemes
lines = text.split(sep="\n")
for line in lines:
current_phonemes = [None] * proximity
for word in line.split(sep=" "):
word = no_punct(word) # remove punctuation marks for correct identification
total_words += 1
if word not in stopwords:
if (get_phonemes(word)[0] in current_phonemes): # alliteration occurred
count += 1
current_phonemes[i] = get_phonemes(word)[0] # update new comparison phoneme
i = 0 if i == 1 else 1 # update storage index
alliteration_score = count / total_words
return alliteration_score
上面是建议的脚本。引入变量proximity
是为了使我们以重复的方式考虑音节,否则这些音节由多个单词分隔。 stress_symbols
变量反映了CMU词典上指示的压力水平,可以很容易地将其合并到函数中。