这是解释这个问题的最简单方法。这是我正在使用的:
re.split('\W', 'foo/bar spam\neggs')
-> ['foo', 'bar', 'spam', 'eggs']
这就是我想要的:
someMethod('\W', 'foo/bar spam\neggs')
-> ['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']
原因是我想将一个字符串拆分成标记,操纵它,然后再将它重新组合在一起。
答案 0 :(得分:220)
>>> re.split('(\W)', 'foo/bar spam\neggs')
['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']
答案 1 :(得分:18)
如果要拆分换行符,请使用splitlines(True)
。
>>> 'line 1\nline 2\nline without newline'.splitlines(True)
['line 1\n', 'line 2\n', 'line without newline']
(不是一般解决方案,但是如果有人来这里没有意识到这种方法存在的话,可以在此处添加。)
答案 2 :(得分:10)
另一种在Python 3上运行良好的非正则表达式
# Split strings and keep separator
test_strings = ['<Hello>', 'Hi', '<Hi> <Planet>', '<', '']
def split_and_keep(s, sep):
if not s: return [''] # consistent with string.split()
# Find replacement character that is not used in string
# i.e. just use the highest available character plus one
# Note: This fails if ord(max(s)) = 0x10FFFF (ValueError)
p=chr(ord(max(s))+1)
return s.replace(sep, sep+p).split(p)
for s in test_strings:
print(split_and_keep(s, '<'))
# If the unicode limit is reached it will fail explicitly
unicode_max_char = chr(1114111)
ridiculous_string = '<Hello>'+unicode_max_char+'<World>'
print(split_and_keep(ridiculous_string, '<'))
答案 3 :(得分:7)
如果您只有一个分隔符,则可以使用列表推导:
text = 'foo,bar,baz,qux'
sep = ','
追加/预先分隔:
result = [x+sep for x in text.split(sep)]
#['foo,', 'bar,', 'baz,', 'qux,']
# to get rid of trailing
result[-1] = result[-1].strip(sep)
#['foo,', 'bar,', 'baz,', 'qux']
result = [sep+x for x in text.split(sep)]
#[',foo', ',bar', ',baz', ',qux']
# to get rid of trailing
result[0] = result[0].strip(sep)
#['foo', ',bar', ',baz', ',qux']
分隔符是它自己的元素:
result = [u for x in text.split(sep) for u in (x, sep)]
#['foo', ',', 'bar', ',', 'baz', ',', 'qux', ',']
results = result[:-1] # to get rid of trailing
答案 4 :(得分:6)
另一个例子,拆分非字母数字并保留分隔符
import re
a = "foo,bar@candy*ice%cream"
re.split('([^a-zA-Z0-9])',a)
输出:
['foo', ',', 'bar', '@', 'candy', '*', 'ice', '%', 'cream']
解释
re.split('([^a-zA-Z0-9])',a)
() <- keep the separators
[] <- match everything in between
^a-zA-Z0-9 <-except alphabets, upper/lower and numbers.
答案 5 :(得分:3)
您还可以使用字符串数组而不是正则表达式拆分字符串,如下所示:
def tokenizeString(aString, separators):
#separators is an array of strings that are being used to split the the string.
#sort separators in order of descending length
separators.sort(key=len)
listToReturn = []
i = 0
while i < len(aString):
theSeparator = ""
for current in separators:
if current == aString[i:i+len(current)]:
theSeparator = current
if theSeparator != "":
listToReturn += [theSeparator]
i = i + len(theSeparator)
else:
if listToReturn == []:
listToReturn = [""]
if(listToReturn[-1] in separators):
listToReturn += [""]
listToReturn[-1] += aString[i]
i += 1
return listToReturn
print(tokenizeString(aString = "\"\"\"hi\"\"\" hello + world += (1*2+3/5) '''hi'''", separators = ["'''", '+=', '+', "/", "*", "\\'", '\\"', "-=", "-", " ", '"""', "(", ")"]))
答案 6 :(得分:3)
这是一个简单的.split
解决方案,无需使用正则表达式。
这是对Python split() without removing the delimiter的回答,因此与原始帖子所要求的不完全相同,但另一个问题已作为该帖子的副本被关闭。
def splitkeep(s, delimiter):
split = s.split(delimiter)
return [substr + delimiter for substr in split[:-1]] + [split[-1]]
随机测试:
import random
CHARS = [".", "a", "b", "c"]
assert splitkeep("", "X") == [""] # 0 length test
for delimiter in ('.', '..'):
for _ in range(100000):
length = random.randint(1, 50)
s = "".join(random.choice(CHARS) for _ in range(length))
assert "".join(splitkeep(s, delimiter)) == s
答案 7 :(得分:2)
# This keeps all separators in result
##########################################################################
import re
st="%%(c+dd+e+f-1523)%%7"
sh=re.compile('[\+\-//\*\<\>\%\(\)]')
def splitStringFull(sh, st):
ls=sh.split(st)
lo=[]
start=0
for l in ls:
if not l : continue
k=st.find(l)
llen=len(l)
if k> start:
tmp= st[start:k]
lo.append(tmp)
lo.append(l)
start = k + llen
else:
lo.append(l)
start =llen
return lo
#############################
li= splitStringFull(sh , st)
['%%(', 'c', '+', 'dd', '+', 'e', '+', 'f', '-', '1523', ')%%', '7']
答案 8 :(得分:1)
如果想要通过正则表达式保留分隔符而不捕获组来分割字符串:
def finditer_with_separators(regex, s):
matches = []
prev_end = 0
for match in regex.finditer(s):
match_start = match.start()
if (prev_end != 0 or match_start > 0) and match_start != prev_end:
matches.append(s[prev_end:match.start()])
matches.append(match.group())
prev_end = match.end()
if prev_end < len(s):
matches.append(s[prev_end:])
return matches
regex = re.compile(r"[\(\)]")
matches = finditer_with_separators(regex, s)
如果假设正则表达式被包装到捕获组中:
def split_with_separators(regex, s):
matches = list(filter(None, regex.split(s)))
return matches
regex = re.compile(r"([\(\)])")
matches = split_with_separators(regex, s)
这两种方式也会删除在大多数情况下无用且烦人的空组。
答案 9 :(得分:1)
将所有seperator: (\W)
替换为seperator + new_seperator: (\W;)
由new_seperator: (;)
def split_and_keep(seperator, s):
return re.split(';', re.sub(seperator, lambda match: match.group() + ';', s))
print('\W', 'foo/bar spam\neggs')
答案 10 :(得分:1)
我可以把它留在这儿吗
s = 'foo/bar spam\neggs'
print(s.replace('/', '+++/+++').replace(' ', '+++ +++').replace('\n', '+++\n+++').split('+++'))
['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']
答案 11 :(得分:0)
之前发布的一些答案会重复分隔符,或者在我的案例中遇到一些其他错误。你可以改用这个函数:
def split_and_keep_delimiter(input, delimiter):
result = list()
idx = 0
while delimiter in input:
idx = input.index(delimiter);
result.append(input[0:idx+len(delimiter)])
input = input[idx+len(delimiter):]
result.append(input)
return result
答案 12 :(得分:0)
安装wrs“无需移除SPLITOR”
pip install wrs
(由 Rao Hamza 开发)
import wrs
text = "Now inbox “how to make spam ad” Invest in hard email marketing."
splitor = 'email | spam | inbox'
list = wrs.wr_split(splitor, text)
print(list)
结果: ['现在'、'收件箱“如何制作'、'垃圾邮件广告”大力投入'、'电子邮件营销。']
答案 13 :(得分:0)
使用 re.split 并且您的正则表达式来自变量并且您有多个分隔符,您可以使用如下:
# BashSpecialParamList is the special param in bash,
# such as your separator is the bash special param
BashSpecialParamList = ["$*", "$@", "$#", "$?", "$-", "$$", "$!", "$0"]
# aStr is the the string to be splited
aStr = "$a Klkjfd$0 $? $#%$*Sdfdf"
reStr = "|".join([re.escape(sepStr) for sepStr in BashSpecialParamList])
re.split(f'({reStr})', aStr)
# Then You can get the result:
# ['$a Klkjfd', '$0', ' ', '$?', ' ', '$#', '%', '$*', 'Sdfdf']
答案 14 :(得分:0)
我发现这种基于生成器的方法更加令人满意:
def split_keep(string, sep):
"""Usage:
>>> list(split_keep("a.b.c.d", "."))
['a.', 'b.', 'c.', 'd']
"""
start = 0
while True:
end = string.find(sep, start) + 1
if end == 0:
break
yield string[start:end]
start = end
yield string[start:]
它避免了找出正确的正则表达式的需要,而在理论上应该相当便宜。它不会创建新的字符串对象,而是将大部分迭代工作委托给高效的find方法。
...并且在python 3.8中可以很短:
def split_keep(string, sep):
start = 0
while (end := string.find(sep, start) + 1) > 0:
yield string[start:end]
start = end
yield string[start:]
答案 15 :(得分:0)
我在尝试分割文件路径时遇到了类似的问题,并努力寻找一个简单的答案。 这对我有用,并且不需要将分隔符替换回拆分文本中:
my_path = 'folder1/folder2/folder3/file1'
import re
re.findall('[^/]+/|[^/]+', my_path)
返回:
['folder1/', 'folder2/', 'folder3/', 'file1']
答案 16 :(得分:0)
一个懒惰和简单的解决方案
假设您的正则表达式模式为split_pattern = r'(!|\?)'
首先,您添加与新分隔符相同的字符,例如'[cut]'
new_string = re.sub(split_pattern, '\\1[cut]', your_string)
然后拆分新的分隔符new_string.split('[cut]')