我有一个文本数据集,我从中提取所有"句子"包含模式r'\b' + ' (?:\w+ )?(?:\w+ )?'.join(my_words_markers) + r'\b'
。
我现在想要减少所有长期"句子" (> 200个单词),对于更具可读性的单词,例如,在pattern
之前和之后仅使用30个单词,用" ..."替换修剪部分。
有干净的方法吗?
编辑:搜索是在预处理文本(小写,删除停用词和标点符号以及其他手工选择的单词)上进行的,然后匹配的句子以其原始形式存储。 我想对原始句子进行修剪(用标点符号和停用词)
实施例
t1 = "This is a complete sentence, containing colors and other words: pink, blue, yellow, tree and chair, orange, green, hello, world, black, sofa, brown. It will be preprocessed"
t2 = preprocess(t1) # ---> "complete sentence containing colors words pink blue yellow tree chair orange green hello world black sofa brown preprocessed"
my_words_markers = "yellow orange".split()
pattern = r'\b' + ' (?:\w+ )?(?:\w+ )?'.join(my_words_markers) + r'\b'
match = re.search(pattern, t2, re.I)
if match: list_of_sentences.append(t1)
在这个list_of_senteces
中,我想修剪最长的那些:
# what I want is a trimmed version of t1, with, e.g., 4 words before and after pattern:
"... other words: pink, blue, yellow, tree and chair, orange, green, hello, world, black ..."
答案 0 :(得分:1)
您可以扩展正则表达式,使其在模式前后最多匹配30个单词:
pattern = r'(?:\w+\W+){,30}\b' + \
r' (?:\w+ )?(?:\w+ )?'.join(my_words_markers) + \
r'\b(?:\W+\w+){,30}'
然后遍历所有句子,如果正则表达式匹配,请使用match.start()
和match.end()
检查是否必须插入省略号...
:
for sentence in sentences:
match = re.search(pattern, sentence)
if match:
text = '{}{}{}'.format('...' if match.start() > 0 else '',
match.group(),
'...' if match.end() < len(sentence) else '')
print(text)