使用regex python

时间:2018-02-01 11:32:58

标签: python regex substring

我有一个文本数据集,我从中提取所有"句子"包含模式r'\b' + ' (?:\w+ )?(?:\w+ )?'.join(my_words_markers) + r'\b'

我现在想要减少所有长期"句子" (> 200个单词),对于更具可读性的单词,例如,在pattern之前和之后仅使用30个单词,用" ..."替换修剪部分。

有干净的方法吗?

编辑:搜索是在预处理文本(小写,删除停用词和标点符号以及其他手工选择的单词)上进行的,然后匹配的句子以其原始形式存储。 我想对原始句子进行修剪(用标点符号和停用词)

实施例

t1 = "This is a complete sentence, containing colors and other words: pink, blue, yellow, tree and chair, orange, green, hello, world, black, sofa, brown. It will be preprocessed"
t2 = preprocess(t1)  # ---> "complete sentence containing colors words pink blue yellow tree chair orange green hello world black sofa brown preprocessed"
my_words_markers = "yellow orange".split()
pattern = r'\b' + ' (?:\w+ )?(?:\w+ )?'.join(my_words_markers) + r'\b'
match = re.search(pattern, t2, re.I)
if match: list_of_sentences.append(t1)

在这个list_of_senteces中,我想修剪最长的那些:

# what I want is a trimmed version of t1, with, e.g., 4 words before and after pattern: 
"... other words: pink, blue, yellow, tree and chair, orange, green, hello, world, black ..."

1 个答案:

答案 0 :(得分:1)

您可以扩展正则表达式,使其在模式前后最多匹配30个单词:

pattern = r'(?:\w+\W+){,30}\b' + \
          r' (?:\w+ )?(?:\w+ )?'.join(my_words_markers) + \
          r'\b(?:\W+\w+){,30}'

然后遍历所有句子,如果正则表达式匹配,请使用match.start()match.end()检查是否必须插入省略号...

for sentence in sentences:
    match = re.search(pattern, sentence)
    if match:
        text = '{}{}{}'.format('...' if match.start() > 0 else '',
                               match.group(),
                               '...' if match.end() < len(sentence) else '')
        print(text)