Question

我有一串文字，我已经分开了。

from itertools import groupby
from operator import itemgetter
indexes = []
for index, sentence in enumerate(sentences):
    if (re.findall('\d+', sentence)):
        indexes.append(index)

我想提取包含数字的所有句子序列，以及之前和之后的句子。因此，输出应该是以下字符串：

＆＃34;在过去12个月内阅读过至少一本小说的成年人人数下降至47％。从2002年到2008年，小说阅读量有所上升。去年小说阅读量的下降主要发生在男性中。＆＃34;

＆＃34;女性阅读更多小说。再多50％。虽然它在过去十年中下降了10％。男性更容易阅读非小说类作品。＆＃34;

＆＃34;年轻人更有可能阅读小说。去年，只有54％的美国人开了一本书。但小说遭受的不仅仅是非虚构的。＆＃34;

首先，我找到包含数字的所有字符串的索引：

index_groupings = []
for k, g in groupby(enumerate(indexes), lambda (i, x): i-x): 
    index_groupings.append(map(itemgetter(1), g))

multiple_sents = [] #store sentence sequences
single_sent = [] #store single sentences
multiple_indexes = [] 
single_index = []
for grouping in index_groupings:
    if len(grouping) > 1:
        multiple_indexes.append(grouping)       
    else:
        single_index.append(grouping)

根据它们是索引序列还是单个索引来打破索引：

if multiple_indexes:
    for grouping in multiple_indexes:
        for index in grouping:
            multiple_sents.append(sentences[index])
else:
    pass
if single_index:
    for grouping in single_index:
        for index in grouping:
            single_sent.append(sentences[index])
else:
    pass

print multiple_sents
print single_sent

分出多个句子序列和单个句子：

{{1}}

当我打印时，我得到：

[＆＃39;在过去12个月内至少阅读过一本小说的成年人人数下降到47％。＆＃39;，＆＃39;小说阅读从2002年到2008年上升。＆＃39;，＆＃39;多50％。＆＃39;，＆＃39;虽然在过去十年中减少了10％。＆＃39;]

[＆＃39;去年，只有54％的美国人破解了任何一本书。＆＃39;]

最好的方法是加入彼此属于的序列以获得上面所需的输出？有更干净的方法吗？

如何仅提取包含数字的句子序列？

0 个答案: