我想使用带有正则表达式(re.sub()
或re.findall()
)的空格来拆分python字符串中的标点符号。因此"I like dog, and I like cat."
应该成为"I like dog , and I like cat . "
我想要替换一串标点符号(python string.punctuation = "!"#$%&'()*+,-./:;<=>?@[\]^_{|}~"
),但我也有一个我不想替换的特定缩写列表(比如说list1 = ["e.g." , "Miss."]
。)我不喜欢我想替换多个标点符号(任意两个标点符号,如...
或,"
)或任何撇号,如I'm, you're, he's, we're
。
所以说我有list1 = ["e.g." , "Miss."]
和string.punctuation = "!"#$%&'()*+,-./:;<=>?@[\]^_{|}~"
。给定字符串"I'm a cat, you're a dog, e.g. a cat... really?, non-dog!!"
,它应该变为"I'm a cat , you're a dog , e.g. a cat ... really ?, non-dog !! "
除了我的特定缩写列表和多个标点符号和撇号之外,是否有可以从字符串中拆分标点符号的正则表达式?
答案 0 :(得分:1)
一般算法是从开始到结束处理输入字符串,扫描下一个“单词”是否在异常列表中(如果是,跳过它)或者是标点字符(如果是,则添加空格)。
这导致以下功能:
Do While headingStart <> -1 And count <= 3
...[Statement]...
count = count + 1
Loop
在测试框架中运行时
def preprocess(string, punctuation, exceptions):
result = ''
i = 0
while i < len(string):
foundException = False
if i == 0 or not(string[i-1].isalpha()):
for e in exceptions:
if string[i:].lower().startswith(e.lower()) and (i+len(e) == len(string) or not(string[i+len(e)].isalpha())):
result += string[i:i+len(e)]
i += len(e)
foundException = True
break
if not(foundException):
if string[i] in punctuation:
result += ' '
while i < len(string) and string[i] in punctuation:
result += string[i]
i += 1
result += ' '
else:
result += string[i]
i += 1
return result.replace(' ', ' ')
你得到第一句的预期结果
examples = """
I like dog, and I like cat.
I'm a cat, you're a dog, e.g. a cat... really?, non-dog!!
"""
for line in examples.split('\n'):
result = preprocess (line, "!\"#$%&'()*+,\\-./:;<=>?@[\]^_{|}~", ["I'm", "you're", "e.g.", "he's", "we're", "Miss."])
print (result)
但第二句将I like dog , and I like cat .
分开:
non-dog
表明你的规范是不精确的(除非I'm a cat , you're a dog , e.g. a cat ... really ?, non - dog !!
在异常列表中;然后它的行为符合预期)。
答案 1 :(得分:0)
我会使用像data = "this is, the data."
myre = re.compile(r"[\.\,\:\;\?\(\)]")
matches = myre.findall(data)
for (var i = 0; i < matches.length; i++) {
data.replace(matches[i], " "+matches[i])
}
这样的正则表达式模式来查找字符串中所有标点符号的匹配列表。然后循环每个匹配,将其替换为自身,并附加一个空格。
示例:
static