好,所以我正在语言学BA上为一门课程做作业,我们正在使用python处理文本。这是我需要做的:
创建一个脚本,该脚本计算三字母组的频率
- 请勿添加虚拟令牌
- 小写每个令牌,并用 下划线
- 输出框中缺少的值是什么?
- 奖金:尝试通过将三元组和三元组存储在 字典
这是我解决最多的方法:
lyrics = "Do you remember 21st night of September ? Love was changing the mind of pretenders While chasing the clouds away Our hearts were ringing In the key that our souls were singing As we danced in the night Remember how the stars stole the night away yeah yeah yeah Hey hey hey Ba de ya say do you remember ? Ba de ya dancing in September Ba de ya never was a cloudy day Ba duda ba duda ba duda badu Ba duda badu ba duda badu Ba duda badu ba duda yeah My thoughts are with you Holding hands with your heart to see you Only blue talk and love Remember how we knew love was here to stay Now December Found the love we shared in September Only blue talk and love Remember the true love we share today Hey hey hey Ba de ya say do you remember ? Ba de ya dancing in September Ba de ya never was a cloudy day There was a Ba de ya say do you remember ? Ba de ya dancing in September Ba de ya golden dreams were shiny days Now our bell was ringing aha Our souls was singing Do you remember every cloudy day yau There was a Ba de ya say do you remember ? Ba de ya dancing in September Ba de ya never was a cloudy day There was a Ba de ya say do you remember ? Ba de ya dancing in September Ba de ya golden dreams were shiny days Ba de ya de ya de ya Ba de ya de ya de ya Ba de ya de ya de ya de ya Ba de ya de ya de ya Ba de ya de ya de ya Ba de ya de ya de ya de ya"
lyric = lyrics.lower()
listText = lyric.split(" ")
freq = {}
while len(listText) > 2:
trigram = (listText[0], listText[1], listText[2])
if trigram in freq.keys():
freq[trigram] += 1
else:
freq[trigram] = 1
listText.pop(0)
sorted_data = sorted(freq.items() , key=lambda x: x[1], reverse = True)
for entry in sorted_data:
print(str(entry[0])+"\t"+str(entry[1]))
我唯一缺少的部分是用下划线连接三字组。它本来应该很简单,但是我无法终生发现如何实现它。假定输出是级联的字母组合词,后跟所述字母组合词的频率。老师说可以很容易地解决,但我不知道。这很有趣,因为我在这里所做的其他所有操作(相对而言)非常快捷,方便。
我尝试了很多事情,但是由于某种原因,我无法使其正常工作。
答案 0 :(得分:2)
您可以使用字符串的join方法。您要做的就是在打印时在三元组的元组上调用'_'.join
。
print(str('_'.join(entry[0]))+"\t"+str(entry[1]))
其他说明:
(1)您可以变得更加Python化,并使用像这样的列表理解来生成listText
:listText = [word.lower() for word in lyrics.split()]
(2)您可以使用字典的setdefault
而不是if/else
来递增/初始化三联词,如下所示:freq.setdefault(trigram, 0)
然后递增freq[trigram] += 1
而不使用任何if / else块。现在,您正在通过freq.keys()
在trigram
语句中搜索if
进行迭代,该语句在Python 3中时间是恒定的(相当于说trigram in freq
),而在Python 3中时间是线性的Python2。
答案 1 :(得分:0)
如果仅是联系他们,您可以使用str.join
trigram = (listText[0], listText[1], listText[2])
c_trigram = '_'.join(*trigram)
您会看到一个无耻的自我插入示例here