Question

我有一些unigrams，bigrams和trigrams的元组，如下所示：

('be',)
('true',)
('But',)
('I',)
('And', 'but')
('but', 'my')
('my', 'Noble')
('For', 'thy', 'escape')
('thy', 'escape', 'would')
('escape', 'would', 'teach')
('would', 'teach', 'me')

我需要找到所有重复项，删除除1以外的所有副本，并将它们格式化为：

I 2
am 2
STOP 2
I am 2
am Sam 1
Sam I 1
Sam STOP 1
* Sam 1
* I 1
am STOP 1
* * I 1
* I am 1
I am Sam 1
am Sam STOP 1

最后的数字，如果有多少重复，星号表示它是在一段时间后被替换的。

到目前为止我的代码：

with open(file, "r") as filestring:
data = filestring.read().replace('\n', '').replace(',', ' ').replace('.', '    <STOP>').replace("'", '').replace(':', ' ')
txtlist = data.split()
uni = zip(*[txtlist[i:] for i in range(1)])
bi = zip(*[txtlist[i:] for i in range(2)])
tri = zip(*[txtlist[i:] for i in range(3)])
with open("output.txt", "w") as myfile:
    for item in uni:
        myfile.write(str(item)+"\n")
    for item in bi:
        myfile.write(str(item)+"\n")
    for item in tri:
        myfile.write(str(item)+"\n")

Python：格式化和计数元组

0 个答案: