Question

我正在尝试从语料库创建依赖项解析器。语料库为conll格式，因此我有一个函数来读取文件并返回列表列表，其中每个列表都是一个已解析的句子（我正在使用的语料库已被解析，我的工作是在此中找到另一个替代方法解析）。我的教授要求随机选取该语料库中5％的句子，因为它太大了。

我尝试创建一个空列表并使用append函数，但是我不知道如何通过索引来指定我要从语料库的每100个句子中选出5个

我用于转换conll文件的功能如下：

import os, nltk, glob
def read_files(path):
    """
    Function to load Ancora Dependency corpora (GLICOM style)
    path = full path to the files
    returns de corpus in sentences
        each sentence is a list of tuples
            each tuple is a token with the follwoing info:
                index of the token in the sentence
                token
                lemma
                POS /es pot eliminar
                POS
                FEAT /es pot eliminar
                head
                DepRelation
    """
    corpus = []
    for f in glob.glob(path):
        sents1 = open(f).read()[185:-2].split('\n\n')
        sents2 = []
        for n in range(len(sents1)):
            sents2.append(sents1[n].split('\n'))
        sents3 = []
        for s in sents2:
            sent = []
            for t in s:
                sent.append(tuple(t.split('\t')))
            sents3.append(sent)
        corpus.extend(sents3)
    return corpus

我想要一种从语料库中每100个句子中选择5个句子的方法，这样我就可以获得仅包含这些句子的列表列表。预先感谢！

Answer 1

只需使用random.sample：

# define path here
corpus = read_files(path)

random.sample(corpus, len(corpus) // 20)

Answer 2

您可以添加一个循环来添加列表吗？所以这样的事情使用模数运算符“％”，它在100个句子中只会得到5个：

counter = 0
new_list =[]
for i in my_list:
  counter = counter +1 
  if counter % 20 ==0:
       new_list.append(i)
  else:
       continue

是否可以创建由一定比例的其他列表元素组成的列表？

2 个答案: