Question

string.split()会返回列表实例。是否有返回generator的版本？是否有任何理由反对拥有发电机版本？

Answer 1

re.finditer极有可能使用相当小的内存开销。

def split_iter(string):
    return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))

演示：

>>> list( split_iter("A programmer's RegEx test.") )
['A', "programmer's", 'RegEx', 'test']

编辑我刚刚确认这在python 3.2.1中需要恒定内存，假设我的测试方法是正确的。我创建了一个非常大的字符串（1GB左右），然后用for循环遍历迭代（不是列表理解，这会生成额外的内存）。这并没有导致内存显着增长（也就是说，如果内存增长，它远远小于1GB字符串）。

Answer 2

我能够想到使用offset方法的str.find()参数编写一个最有效的方法。这避免了大量内存使用，并且在不需要时依赖于正则表达式的开销。

[编辑2016-8-2：更新此选项以支持正则表达式分隔符]

def isplit(source, sep=None, regex=False):
    """
    generator version of str.split()

    :param source:
        source string (unicode or bytes)

    :param sep:
        separator to split on.

    :param regex:
        if True, will treat sep as regular expression.

    :returns:
        generator yielding elements of string.
    """
    if sep is None:
        # mimic default python behavior
        source = source.strip()
        sep = "\\s+"
        if isinstance(source, bytes):
            sep = sep.encode("ascii")
        regex = True
    if regex:
        # version using re.finditer()
        if not hasattr(sep, "finditer"):
            sep = re.compile(sep)
        start = 0
        for m in sep.finditer(source):
            idx = m.start()
            assert idx >= start
            yield source[start:idx]
            start = m.end()
        yield source[start:]
    else:
        # version using str.find(), less overhead than re.finditer()
        sepsize = len(sep)
        start = 0
        while True:
            idx = source.find(sep, start)
            if idx == -1:
                yield source[start:]
                return
            yield source[start:idx]
            start = idx + sepsize

这可以像你想要的那样使用......

>>> print list(isplit("abcb","b"))
['a','c','']

虽然每次执行find（）或切片时都会在字符串中进行一些成本搜索，但这应该是最小的，因为字符串在内存中表示为连续数组。

Answer 3

这是split()的生成器版本，通过re.search()实现，没有分配太多子字符串的问题。

import re

def itersplit(s, sep=None):
    exp = re.compile(r'\s+' if sep is None else re.escape(sep))
    pos = 0
    while True:
        m = exp.search(s, pos)
        if not m:
            if pos < len(s) or sep is not None:
                yield s[pos:]
            break
        if pos < m.start() or sep is not None:
            yield s[pos:m.start()]
        pos = m.end()


sample1 = "Good evening, world!"
sample2 = " Good evening, world! "
sample3 = "brackets][all][][over][here"
sample4 = "][brackets][all][][over][here]["

assert list(itersplit(sample1)) == sample1.split()
assert list(itersplit(sample2)) == sample2.split()
assert list(itersplit(sample3, '][')) == sample3.split('][')
assert list(itersplit(sample4, '][')) == sample4.split('][')

编辑：如果没有给出分隔符字符，则更正了对周围空格的处理。

Answer 4

对所提出的各种方法进行了一些性能测试（我在这里不再重复）。一些结果：

str.split（默认= 0.3461570239996945
手动搜索（按字符）（Dave Webb的答案之一）= 0.8260340550004912
re.finditer（ninjagecko的回答）= 0.698872097000276
str.find（Eli Collins的答案之一）= 0.7230395330007013
itertools.takewhile（Ignacio Vazquez-Abrams的回答）= 2.023023967998597
str.split(..., maxsplit=1)递归= N / A†

†递归答案（string.split和maxsplit = 1）无法在合理的时间内完成，给定string.split的速度，它们可以在较短的字符串上更好地工作，但之后我无法查看内存不是问题的短字符串的用例。

使用timeit进行测试：

the_text = "100 " * 9999 + "100"

def test_function( method ):
    def fn( ):
        total = 0

        for x in method( the_text ):
            total += int( x )

        return total

    return fn

这提出了另一个问题，即为什么string.split尽管内存使用速度要快得多。

Answer 5

这是我的实现，它比这里的其他答案更快，更快，更完整。它有4个独立的子功能，适用于不同的情况。

我只需要复制主str_split函数的文档字符串：

str_split(s, *delims, empty=None)

将字符串s拆分为其余参数，可能省略空部分（empty关键字参数负责）。这是一个发电机功能。

当只提供一个分隔符时，字符串就会被它拆分。 <{1}}默认为empty。

True

当提供多个分隔符时，字符串被拆分最长默认情况下，这些分隔符的可能序列，或者，如果str_split('[]aaa[][]bb[c', '[]') -> '', 'aaa', '', 'bb[c' str_split('[]aaa[][]bb[c', '[]', empty=False) -> 'aaa', 'bb[c'设置为 empty，还包括分隔符之间的空字符串。注意在这种情况下，分隔符可能只是单个字符。

True

如果没有提供分隔符，则使用str_split('aaa, bb : c;', ' ', ',', ':', ';') -> 'aaa', 'bb', 'c' str_split('aaa, bb : c;', *' ,:;', empty=True) -> 'aaa', '', 'bb', '', '', 'c', ''，因此效果如此与string.whitespace相同，但此函数是生成器。

str.split()

str_split('aaa\\t  bb c \\n')
    -> 'aaa', 'bb', 'c'

此功能适用于Python 3，可以应用一个简单但非常难看的修复程序，使其在2和3版本中都能正常工作。该函数的第一行应更改为：

import string

def _str_split_chars(s, delims):
    "Split the string `s` by characters contained in `delims`, including the \
    empty parts between two consecutive delimiters"
    start = 0
    for i, c in enumerate(s):
        if c in delims:
            yield s[start:i]
            start = i+1
    yield s[start:]

def _str_split_chars_ne(s, delims):
    "Split the string `s` by longest possible sequences of characters \
    contained in `delims`"
    start = 0
    in_s = False
    for i, c in enumerate(s):
        if c in delims:
            if in_s:
                yield s[start:i]
                in_s = False
        else:
            if not in_s:
                in_s = True
                start = i
    if in_s:
        yield s[start:]


def _str_split_word(s, delim):
    "Split the string `s` by the string `delim`"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    yield s[start:]

def _str_split_word_ne(s, delim):
    "Split the string `s` by the string `delim`, not including empty parts \
    between two consecutive delimiters"
    dlen = len(delim)
    start = 0
    try:
        while True:
            i = s.index(delim, start)
            if start!=i:
                yield s[start:i]
            start = i+dlen
    except ValueError:
        pass
    if start<len(s):
        yield s[start:]


def str_split(s, *delims, empty=None):
    """\
Split the string `s` by the rest of the arguments, possibly omitting
empty parts (`empty` keyword argument is responsible for that).
This is a generator function.

When only one delimiter is supplied, the string is simply split by it.
`empty` is then `True` by default.
    str_split('[]aaa[][]bb[c', '[]')
        -> '', 'aaa', '', 'bb[c'
    str_split('[]aaa[][]bb[c', '[]', empty=False)
        -> 'aaa', 'bb[c'

When multiple delimiters are supplied, the string is split by longest
possible sequences of those delimiters by default, or, if `empty` is set to
`True`, empty strings between the delimiters are also included. Note that
the delimiters in this case may only be single characters.
    str_split('aaa, bb : c;', ' ', ',', ':', ';')
        -> 'aaa', 'bb', 'c'
    str_split('aaa, bb : c;', *' ,:;', empty=True)
        -> 'aaa', '', 'bb', '', '', 'c', ''

When no delimiters are supplied, `string.whitespace` is used, so the effect
is the same as `str.split()`, except this function is a generator.
    str_split('aaa\\t  bb c \\n')
        -> 'aaa', 'bb', 'c'
"""
    if len(delims)==1:
        f = _str_split_word if empty is None or empty else _str_split_word_ne
        return f(s, delims[0])
    if len(delims)==0:
        delims = string.whitespace
    delims = set(delims) if len(delims)>=4 else ''.join(delims)
    if any(len(d)>1 for d in delims):
        raise ValueError("Only 1-character multiple delimiters are supported")
    f = _str_split_chars if empty else _str_split_chars_ne
    return f(s, delims)

Answer 6

~~我认为split()的生成器版本没有任何明显的好处。生成器对象将必须包含整个字符串以进行迭代，因此您不会通过生成器来保存任何内存。~~

如果你想写一个，那将是相当容易的：

import string

def gsplit(s,sep=string.whitespace):
    word = []

    for c in s:
        if c in sep:
            if word:
                yield "".join(word)
                word = []
        else:
            word.append(c)

    if word:
        yield "".join(word)

Answer 7

如果您还希望能够读取迭代器（以及返回），请尝试以下操作：

import itertools as it

def iter_split(string, sep=None):
    sep = sep or ' '
    groups = it.groupby(string, lambda s: s != sep)
    return (''.join(g) for k, g in groups if k)

用法

>>> list(iter_split(iter("Good evening, world!")))
['Good', 'evening,', 'world!']

Answer 8

我编写了一个@ninjagecko的答案，其行为更像string.split（即默认情况下用空格分隔，你可以指定分隔符）。

def isplit(string, delimiter = None):
    """Like string.split but returns an iterator (lazy)

    Multiple character delimters are not handled.
    """

    if delimiter is None:
        # Whitespace delimited by default
        delim = r"\s"

    elif len(delimiter) != 1:
        raise ValueError("Can only handle single character delimiters",
                        delimiter)

    else:
        # Escape, incase it's "\", "*" etc.
        delim = re.escape(delimiter)

    return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string))

以下是我使用的测试（在python 3和python 2中）：

# Wrapper to make it a list
def helper(*args,  **kwargs):
    return list(isplit(*args, **kwargs))

# Normal delimiters
assert helper("1,2,3", ",") == ["1", "2", "3"]
assert helper("1;2;3,", ";") == ["1", "2", "3,"]
assert helper("1;2 ;3,  ", ";") == ["1", "2 ", "3,  "]

# Whitespace
assert helper("1 2 3") == ["1", "2", "3"]
assert helper("1\t2\t3") == ["1", "2", "3"]
assert helper("1\t2 \t3") == ["1", "2", "3"]
assert helper("1\n2\n3") == ["1", "2", "3"]

# Surrounding whitespace dropped
assert helper(" 1 2  3  ") == ["1", "2", "3"]

# Regex special characters
assert helper(r"1\2\3", "\\") == ["1", "2", "3"]
assert helper(r"1*2*3", "*") == ["1", "2", "3"]

# No multi-char delimiters allowed
try:
    helper(r"1,.2,.3", ",.")
    assert False
except ValueError:
    pass

python的正则表达式模块说它是does "the right thing"用于unicode空格，但我还没有真正测试过它。

也可以gist。

Answer 9

不，但使用itertools.takewhile()编写一个应该很容易。

修改

非常简单，半破坏的实施：

import itertools import string def isplitwords(s): i = iter(s) while True: r = [] for c in itertools.takewhile(lambda x: not x in string.whitespace, i): r.append(c) else: if r: yield ''.join(r) continue else: raise StopIteration()

Answer 10

我想展示如何使用find_iter解决方案返回给定分隔符的生成器，然后使用itertools中的成对配方构建前一个下一个迭代，它将获得原始分割方法中的实际单词。

from more_itertools import pairwise
import re

string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d"
delimiter = " "
# split according to the given delimiter including segments beginning at the beginning and ending at the end
for prev, curr in pairwise(re.finditer("^|[{0}]+|$".format(delimiter), string)):
    print(string[prev.end(): curr.start()])

注意：

我使用prev＆amp; curr而不是prev＆amp;接下来因为在python中覆盖下一个是一个非常糟糕的主意
这非常有效

Answer 11

more_itertools.spit_at为迭代器提供了str.split的模拟。

>>> import more_itertools as mit


>>> list(mit.split_at("abcdcba", lambda x: x == "b"))
[['a'], ['c', 'd', 'c'], ['a']]

>>> "abcdcba".split("b")
['a', 'cdc', 'a']

more_itertools是第三方软件包。

Answer 12

最笨的方法，不带正则表达式/ itertools：

def isplit(text, split='\n'):
    while text != '':
        end = text.find(split)

        if end == -1:
            yield text
            text = ''
        else:
            yield text[:end]
            text = text[end + 1:]

Answer 13

很老的问题，但这是我对高效算法的谦虚贡献：

def str_split(text: str, separator: str) -> Iterable[str]:
    i = 0
    n = len(text)
    while i <= n:
        j = text.find(separator, i)
        if j == -1:
            j = n
        yield text[i:j]
        i = j + 1

Answer 14

def split_generator(f,s):
    """
    f is a string, s is the substring we split on.
    This produces a generator rather than a possibly
    memory intensive list. 
    """
    i=0
    j=0
    while j<len(f):
        if i>=len(f):
            yield f[j:]
            j=i
        elif f[i] != s:
            i=i+1
        else:
            yield [f[j:i]]
            j=i+1
            i=i+1

Answer 15

def isplit(text, sep=None, maxsplit=-1):
    if not isinstance(text, (str, bytes)):
        raise TypeError(f"requires 'str' or 'bytes' but received a '{type(text).__name__}'")
    if sep in ('', b''):
        raise ValueError('empty separator')

    if maxsplit == 0 or not text:
        yield text
        return

    regex = (
        re.escape(sep) if sep is not None
        else [br'\s+', r'\s+'][isinstance(text, str)]
    )
    yield from re.split(regex, text, maxsplit=max(0, maxsplit))

Answer 16

对我来说，至少需要使用用作生成器的文件。

这是我为一些带有空行分隔文本的大文件做的准备工作（如果你在生产系统中使用它，需要对角落情况进行全面测试）：

from __future__ import print_function

def isplit(iterable, sep=None):
    r = ''
    for c in iterable:
        r += c
        if sep is None:
            if not c.strip():
                r = r[:-1]
                if r:
                    yield r
                    r = ''                    
        elif r.endswith(sep):
            r=r[:-len(sep)]
            yield r
            r = ''
    if r:
        yield r


def read_blocks(filename):
    """read a file as a sequence of blocks separated by empty line"""
    with open(filename) as ifh:
        for block in isplit(ifh, '\n\n'):
            yield block.splitlines()           

if __name__ == "__main__":
    for lineno, block in enumerate(read_blocks("logfile.txt"), 1):
        print(lineno,':')
        print('\n'.join(block))
        print('-'*40)

    print('Testing skip with None.')
    for word in isplit('\tTony   \t  Jarkko \n  Veijalainen\n'):
        print(word)

Answer 17

这是一个简单的答复

tshark

Python中是否有`string.split（）`的生成器版本？

17 个答案: