自然语言语料库字符串转换为int

时间:2018-11-22 17:50:21

标签: python iteration corpus

从corpus1,corpus2和corpus3语料库的每个句子中抽取一个样本,并显示平均长度(以句子中的字符数来衡量)。

所以我有3个语料,并且sample_raw_sents是一个定义函数,用于返回随机句子:

tcr = corpus1()
rcr = corpus2()
mcr = corpus3()  
sample_size=50
for sentence in tcr.sample_raw_sents(sample_size):
    print(len(sentence))
for sentence in rcr.sample_raw_sents(sample_size):
    print(len(sentence))
for sentence in mcr.sample_raw_sents(sample_size):
    print(len(sentence))  

因此使用此代码可以打印所有长度,尽管我如何对这些长度进行sum()?

3 个答案:

答案 0 :(得分:1)

使用zip,它将使您可以一次从每个语料库中提取一个句子。

tcr = corpus1()
rcr = corpus2()
mcr = corpus3()  
sample_size=50

zipped = zip(tcr.sample_raw_sents(sample_size),
             rcr.sample_raw_sents(sample_size),
             mcr.sample_raw_sents(sample_size))

for s1, s2, s3 in zipped:
    summed = len(s1) + len(s2) + len(s3)
    average = summed/3
    print(summed, average)

答案 1 :(得分:0)

您可以将sentences的所有长度存储在list中,然后对其求和。

tcr = corpus1()
rcr = corpus2()
mcr = corpus3()  
sample_size=50

lengths = []
for sentence in tcr.sample_raw_sents(sample_size):
    lengths.append(len(sentence))
for sentence in rcr.sample_raw_sents(sample_size):
    lengths.append(len(sentence))
for sentence in mcr.sample_raw_sents(sample_size):
    lengths.append(len(sentence))

print(sum(lengths) / len(lengths))

答案 2 :(得分:-1)

tcr = corpus1()
rcr = corpus2()
mcr = corpus3()  
sample_size=50
s = 0
for sentence in tcr.sample_raw_sents(sample_size):
    s = s + len(sentence)
for sentence in rcr.sample_raw_sents(sample_size):
    s = s + len(sentence)
for sentence in mcr.sample_raw_sents(sample_size):
    s = s + len(sentence)

average = s/150
print('average: {}'.format(average))