Question

python是否有内置（意味着在标准库中）对产生迭代器而不是列表的字符串进行拆分？我在考虑使用非常长的字符串而不需要消耗大部分字符串。

Answer 1

不直接拆分字符串，但re模块在任何已编译的正则表达式上都有re.finditer()（和相应的finditer()方法）。

@Zero问了一个例子：

>>> import re
>>> s = "The quick    brown\nfox"
>>> for m in re.finditer('\S+', s):
...     print(m.span(), m.group(0))
... 
(0, 3) The
(4, 9) quick
(13, 18) brown
(19, 22) fox

Answer 2

和s.Lott一样，我不太清楚你想要什么。以下是可能有用的代码：

s = "This is a string."
for character in s:
    print character
for word in s.split(' '):
    print word

还有s.index（）和s.find（）用于查找下一个字符。

后来：好的，就像这样。

>>> def tokenizer(s, c):
...     i = 0
...     while True:
...         try:
...             j = s.index(c, i)
...         except ValueError:
...             yield s[i:]
...             return
...         yield s[i:j]
...         i = j + 1
... 
>>> for w in tokenizer(s, ' '):
...     print w
... 
This
is
a
string.

Answer 3

如果您不需要消耗整个字符串，那是因为您正在寻找特定的东西，对吗？然后只需查看，re或.find()而不是分割。这样你就可以找到你感兴趣的字符串部分，并将其拆分。

Answer 4

你可以使用像SPARK这样的东西（虽然不能从标准库中导入，但已经被Python分发本身吸收了），但最终它也使用了正则表达式，因此Duncan's answer可能会服务你也可以这么简单，就像“分裂空白”一样简单。

另一个更艰巨的选择是在C中编写自己的Python模块，如果你真的想要速度的话，那么这当然是一个更大的时间投资。

Answer 5

看看itertools。它包含takewhile，islice和groupby之类的内容，允许您根据索引或排序的布尔条件将可迭代的字符串切片 - 一个可迭代的字符串 - 切换到另一个迭代中。

Answer 6

没有基于迭代器的内置模拟str.split。根据您的需要，您可以创建一个列表迭代器：

iterator = iter("abcdcba".split("b"))
iterator
# <list_iterator at 0x49159b0>
next(iterator)
# 'a'

但是，此第三方库中的工具可能会提供您想要的内容more_itertools.split_at。另请参阅this post以获取示例。

Answer 7

这是一个isplit函数，其行为非常类似于split-您可以使用regex参数关闭正则表达式语法。它使用re.finditer函数，并在匹配项之间返回字符串。

import re

def isplit(s, splitter=r'\s+', regex=True):
    if not regex:
        splitter = re.escape(splitter)

    start = 0

    for m in re.finditer(splitter, s):
        begin, end = m.span()
        if begin != start:
            yield s[start:begin]
        start = end

    if s[start:]:
        yield s[start:]


_examples = ['', 'a', 'a b', ' a  b c ', '\na\tb ']

def test_isplit():
    for example in _examples:
        assert list(isplit(example)) == example.split(), 'Wrong for {!r}: {} != {}'.format(
            example, list(isplit(example)), example.split()
        )

将字符串拆分为迭代器

7 个答案: