Question

Jamies_string = "Hello there {my name is jamie}".split()

print(Jamies_string)

此处输出：

['Hello', 'there', '{my', 'name', 'is', 'jamie}']

此处的所需输出：

['Hello', 'there', '{', 'my', 'name', 'is', 'jamie', '}']

我真的想远离任何涉及使用re库的解决方案，谢谢。

Answer 1

您可以先在这些符号周围添加空格，然后使用split()，例如

>>> s = "Hello there {my name is jamie}"
>>> s.replace("{", " { ").replace("}", " } ").split()
['Hello', 'there', '{', 'my', 'name', 'is', 'jamie', '}']

Answer 2

一种解决方案是创建一个对字符进行分类并将其用作itertools.groupby()的关键函数的函数：

WHITESPACE = 0
LETTERS = 1
DIGITS = 2
SYMBOLS = 3

def character_class(c):
    if c.isspace():
        return WHITESPACE
    if c.isalpha():
        return LETTERS
    if c.isdigit():
        return DIGITS
    return SYMBOLS

s = "Hello there {my name is jamie}"
tokens = [
    "".join(chars)
    for cls, chars in itertools.groupby(s, character_class)
    if cls != WHITESPACE
]
print(tokens)

打印

['Hello', 'there', '{', 'my', 'name', 'is', 'jamie', '}']

您澄清了出于性能原因而希望避免使用正则表达式。这个答案中的方法肯定比使用正则表达式正确慢。但是，我不认为您的项目处于需要担心性能的阶段。

Answer 3

您使用的字符串类似于Python中的format string。如果是这样，您可以使用Formatter类来解析它：

from string import Formatter


def solve(s):
    for f in Formatter().parse(s):
        yield from f[0].split()
        if f[1]:
            yield from ['{'] + f[1].split() + ['}']

<强>演示：

>>> list(solve("Hello there {my name is jamie}"))
['Hello', 'there', '{', 'my', 'name', 'is', 'jamie', '}']

>>> list(solve("Hello there {my name is jamie} {hello world} end."))
['Hello', 'there', '{', 'my', 'name', 'is', 'jamie', '}', '{', 'hello', 'world', '}', 'end.']

Answer 4

一种方式（不像其他答案一样干净，但它有效）：

def tokenize(string):
    WHITESPACE = 0 #Borrowed from Sven's answer
    LETTERS = 1
    DIGITS = 2
    SYMBOLS = 3
    def character_class(c):
        if c.isspace():
            return WHITESPACE
        elif c.isalpha():
            return LETTERS
        elif c.isdigit():
            return DIGITS
        return SYMBOLS

    lastType = character_class(string[0])
    chunk = ""

    for i, char in enumerate(string):
        charType = character_class(char)
        if charType == WHITESPACE:
            if chunk: #Only yield if non-empty
                yield chunk
            chunk = ""
            lastType = character_class(string[i + 1]) #Type of next character because we want the next part to not have leading whitespace
            continue #Don't add to chunk
        elif charType != lastType: #Different type
            if chunk: #Only yield if non-empty
                yield chunk
            chunk = ""
            lastType = charType
        chunk += char
    if chunk:
        yield chunk
print(list(tokenize("Hello there {my name is jamie}")))

示例输出：

['Hello', 'there', '{', 'my', 'name', 'is', 'jamie', '}']

这或多或少是手动执行itertools.groupby所做的事情。

Answer 5

在字符串中进行传递并在所有标点字符周围放置空格，然后在空白处分割。

>>>> import string
>>> s = "Hello there {my name is jamie}"
>>> s = ''.join(c if c.isalnum() or c.isspace() else ' {} '.format(c) for c in s)
>>> s.split()
['Hello', 'there', '{', 'my', 'name', 'is', 'jamie', '}']
>>>

稍微扩展第三行 -

a = []
for c in s:
    if not c.isalnum() and not c.isspace():
        c = ' ' + c + ' '
    a.append(c)

s = ''.join(a)
s.split()

如何拆分字符串，以便在不使用正则表达式的情况下将符号制作成自己的列表项？

5 个答案: