Question

好，我简化我的问题：

我有一个（文件清单），其中包含一些str形式的句子清单。像a = [['Sent1 from first doc!','Sent2 from first doc.'],['Sent1 from 2nd doc.','Sent2 from 2nd doc.']]

现在，我尝试将每个句子拆分为单词列表。.因此，我可能会有一个包含文档列表（句子）的第一个列表（文档），其中每个文档列表都包含（该句子中单词的列表）如str）。

不幸的是，我的代码生成了包含每个单词的列表（句子）。因此，我无法跟踪每个句子来自哪个文档。

我的代码如下：

sentcs = []
for i in range(len(a)): 
    for p in range(len(a[i])):        
        spr = re.findall(r'[A-Z]?[^A-Z\s]+|[A-Z]+', a[i][p])
        sentcs.append(spr)

但是那不是我想要的。.我想要一个列表列表..还是编程这样的东西的坏习惯？

Answer 1

    li = [('Help! Be nice.'),('Thx. Help appreciated.')]

    for el in li:
        l = el.split(' ',1)
        print(tuple((l[0], l[1:])))  

    ('Help!', ['Be nice.'])
    ('Thx.', ['Help appreciated.'])


from nltk.tokenize import sent_tokenize   

st = ['Help! Be nice.','Thx. Help appreciated.']

for el in st:
    t = sent_tokenize(el)
    print(tuple((t[0], t[1:])))

('Help!', ['Be nice.'])
('Thx.', ['Help appreciated.'])

列表列表..列表？应用正则表达式和nltk

1 个答案: