Question

此：

import re

title = 'Decreased glucose-6-phosphate dehydrogenase activity along with oxidative stress affects visual contrast sensitivity in alcoholics.'

words = list(filter(None, re.split('\W+', title)))
for word in words:
    print(word)

导致：

Decreased
glucose
6
phosphate
dehydrogenase
activity
along
with
oxidative
stress
affects
visual
contrast
sensitivity
in
alcoholics

理想情况下，我想防止单词分裂：

glucose-6-phosphate

在Python中，是否有更好的方法来获取句子的单独单词？我应该采用正则表达式还是OOTB？谢谢。

Answer 1

answered表示字符序列（字母）。由于\W+不在这些字符中，因此该句子在此处分割。由于您似乎只在空格处分割，因此不需要正则表达式，因此只需-。

Answer 2

模式\ W在此分组处拆分：[^ a-zA-Z0-9_]因此，要阻止在连字符上进行拆分，只需在该模式中添加一个并在正则表达式中使用它即可。

words = list(filter(None, re.split('[^a-zA-Z0-9_-]+', title)))

防止基于-in句子的单词分裂

2 个答案: