我有一个嵌套的字符串列表,语料库由不同长度的列表组成。我想只保留长度大于2的字符串。
根据how to remove an element from a nested list?中的类似问题,我尝试了所有答案,这些答案允许我指出条件长度> 2.
corpus = list(r_corpus('teeny.txt'))
print('initial corpus here ',corpus)
#Current attempt
[[ subelt for subelt in elt if len(subelt) >2 ] for elt in corpus]
#previous attempt 1
##for thing in corpus:
## [y for y in thing if len(y)>2]
#previous attempt 2
##for sentence in corpus:
## sentence = [x for x in sentence if len(x) > 2 ]
print('\n\n corpus here without any string of length 2 or smaller',corpus)
这是当前尝试的输出,对于之前的两次尝试是相同的。
初始语料库此处
[['extracting', 'opinions'],
['soo', 'min', 'kim', 'and'],
['abstract'],
['this', 'paper', 'presents', 'method', 'for', 'identifying', 'an'],
['this', 'section', 'reviews', 'previous', 'works', 'in'],
['subjectivity', 'detection', 'is'],
['work', 'is', 'similar', 'to', 'ours', 'but', 'different']]
语料库,任何长度为2或更小的字符串
[['extracting', 'opinions'],
['soo', 'min', 'kim', 'and'],
['abstract'],
['this', 'paper', 'presents', 'method', 'for', 'identifying', 'an'],
['this', 'section', 'reviews', 'previous', 'works', 'in'],
['subjectivity', 'detection', 'is'],
['work', 'is', 'similar', 'to', 'ours', 'but', 'different']]
使用第二版语料库而不使用任何长度为2或更小的字符串的最快方法:
语料库,不包含任何长度为2或更小的字符串
[['extracting', 'opinions'],
['soo', 'min', 'kim', 'and'],
['abstract'],
['this', 'paper', 'presents', 'method', 'for', 'identifying'],
['this', 'section', 'reviews', 'previous', 'works'],
['subjectivity', 'detection'],
['work','similar','ours', 'but', 'different']]
感谢。
答案 0 :(得分:0)
@Vera ,您可以尝试以下代码。它使用列表理解, lambda函数, map(),过滤器等概念。
使用列表理解, lambda函数, map(),过滤器(), reduce()等是一种以简单,高效和简洁的方式解决问题的Pythonic方法。
您可以查看List comprehension和map(), filter(), reduce(), lambda function等,查看与这些概念相关的给定示例并说明。
import json
corpus = [['extracting', 'opinions'],
['soo', 'min', 'kim', 'and'],
['abstract'],
['this', 'paper', 'presents', 'method', 'for', 'identifying', 'an'],
['this', 'section', 'reviews', 'previous', 'works', 'in'],
['subjectivity', 'detection', 'is'],
['work', 'is', 'similar', 'to', 'ours', 'but', 'different']]
new_corpus = list( map(lambda words: list(filter(lambda word: len(word)> 2, words)), corpus))
# Pretty printing list of lists of words of length > 2
print(json.dumps(new_corpus, indent=2))
"""
[
[
"extracting",
"opinions"
],
[
"soo",
"min",
"kim",
"and"
],
[
"abstract"
],
[
"this",
"paper",
"presents",
"method",
"for",
"identifying"
],
[
"this",
"section",
"reviews",
"previous",
"works"
],
[
"subjectivity",
"detection"
],
[
"work",
"similar",
"ours",
"but",
"different"
]
]
"""