Question

我有一个文本数据集，我从中提取所有＆＃34;句子＆＃34;包含模式r'\b' + ' (?:\w+ )?(?:\w+ )?'.join(my_words_markers) + r'\b'。

我现在想要减少所有长期＆＃34;句子＆＃34; （> 200个单词），对于更具可读性的单词，例如，在pattern之前和之后仅使用30个单词，用＆＃34; ...＆＃34;替换修剪部分。

有干净的方法吗？

编辑：搜索是在预处理文本（小写，删除停用词和标点符号以及其他手工选择的单词）上进行的，然后匹配的句子以其原始形式存储。我想对原始句子进行修剪（用标点符号和停用词）

实施例

t1 = "This is a complete sentence, containing colors and other words: pink, blue, yellow, tree and chair, orange, green, hello, world, black, sofa, brown. It will be preprocessed"
t2 = preprocess(t1)  # ---> "complete sentence containing colors words pink blue yellow tree chair orange green hello world black sofa brown preprocessed"
my_words_markers = "yellow orange".split()
pattern = r'\b' + ' (?:\w+ )?(?:\w+ )?'.join(my_words_markers) + r'\b'
match = re.search(pattern, t2, re.I)
if match: list_of_sentences.append(t1)

在这个list_of_senteces中，我想修剪最长的那些：

# what I want is a trimmed version of t1, with, e.g., 4 words before and after pattern: 
"... other words: pink, blue, yellow, tree and chair, orange, green, hello, world, black ..."

Answer 1

您可以扩展正则表达式，使其在模式前后最多匹配30个单词：

pattern = r'(?:\w+\W+){,30}\b' + \
          r' (?:\w+ )?(?:\w+ )?'.join(my_words_markers) + \
          r'\b(?:\W+\w+){,30}'

然后遍历所有句子，如果正则表达式匹配，请使用match.start()和match.end()检查是否必须插入省略号...：

for sentence in sentences:
    match = re.search(pattern, sentence)
    if match:
        text = '{}{}{}'.format('...' if match.start() > 0 else '',
                               match.group(),
                               '...' if match.end() < len(sentence) else '')
        print(text)

使用regex python

1 个答案: