Question

我有一个像这样的字符串：

this is "a test"

我正在尝试用Python编写一些东西，用空格分隔它，同时忽略引号内的空格。我正在寻找的结果是：

['this','is','a test']

PS。我知道你会问“如果报价中有引号会发生什么，那么，在我的申请中，这将永远不会发生。

Answer 1

您希望从shlex模块中拆分。

>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']

这应该完全符合您的要求。

Answer 2

查看shlex模块，尤其是shlex.split。

>>> import shlex
>>> shlex.split('This is "a test"')
['This', 'is', 'a test']

Answer 3

我看到这里的正则表达式看起来很复杂和/或错误。这让我感到惊讶，因为正则表达式语法可以很容易地描述“空白或者被引用的东西包围”，并且大多数正则表达式引擎（包括Python）可以在正则表达式上分割。所以如果你要使用正则表达式，为什么不直接说出你的意思呢？：

test = 'this is "a test"'  # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]

说明：

[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators

shlex可能会提供更多功能。

Answer 4

根据您的使用情况，您可能还想查看csv模块：

import csv
lines = ['this is "a string"', 'and more "stuff"']
for row in csv.reader(lines, delimiter=" "):
    print row

输出：

['this', 'is', 'a string']
['and', 'more', 'stuff']

Answer 5

我使用shlex.split来处理70,000,000行鱿鱼日志，它太慢了。所以我改用了。

如果你有shlex的性能问题，请试试这个。

import re

def line_split(line):
    return re.findall(r'[^"\s]\S*|".+?"', line)

Answer 6

由于此问题标有正则表达式，我决定尝试使用正则表达式方法。我首先用\ x00替换引号部分中的所有空格，然后用空格分割，然后将\ x00替换回每个部分中的空格。

两个版本都做同样的事情，但拆分器比splitter2更具可读性。

import re

s = 'this is "a test" some text "another test"'

def splitter(s):
    def replacer(m):
        return m.group(0).replace(" ", "\x00")
    parts = re.sub('".+?"', replacer, s).split()
    parts = [p.replace("\x00", " ") for p in parts]
    return parts

def splitter2(s):
    return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]

print splitter2(s)

Answer 7

由于性能原因，self.assertTrue('ArgumentsException' in str(e.stderr))似乎更快。这是我使用最少贪心运算符保留外引号的解决方案：

re

结果：

re.findall("(?:\".*?\"|\S)+", s)

它将诸如['this', 'is', '"a test"']之类的结构放在一起，因为这些标记之间没有空格。如果字符串包含转义字符，则可以这样进行匹配：

aaa"bla blub"bbb

请注意，它也通过模式的>>> a = "She said \"He said, \\\"My name is Mark.\\\"\"" >>> a 'She said "He said, \\"My name is Mark.\\""' >>> for i in re.findall("(?:\".*?[^\\\\]\"|\S)+", a): print(i) ... She said "He said, \"My name is Mark.\""部分与空字符串""匹配。

Answer 8

要保留引号，请使用此功能：

def getArgs(s):
    args = []
    cur = ''
    inQuotes = 0
    for char in s.strip():
        if char == ' ' and not inQuotes:
            args.append(cur)
            cur = ''
        elif char == '"' and not inQuotes:
            inQuotes = 1
            cur += char
        elif char == '"' and inQuotes:
            inQuotes = 0
            cur += char
        else:
            cur += char
    args.append(cur)
    return args

Answer 9

为了解决一些Python 2版本中的unicode问题，我建议：

from shlex import split as _split
split = lambda a: [b.decode('utf-8') for b in _split(a.encode('utf-8'))]

Answer 10

不同答案的速度测试：

import re
import shlex
import csv

line = 'this is "a test"'

%timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
100000 loops, best of 3: 5.17 µs per loop

%timeit re.findall(r'[^"\s]\S*|".+?"', line)
100000 loops, best of 3: 2.88 µs per loop

%timeit list(csv.reader([line], delimiter=" "))
The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.4 µs per loop

%timeit shlex.split(line)
10000 loops, best of 3: 50.2 µs per loop

Answer 11

已接受的shlex方法的主要问题在于，它不会忽略引用的子字符串之外的转义字符，并且在某些特殊情况下会产生一些意想不到的结果。

我有以下用例，其中我需要一个分割函数，该函数对输入字符串进行分割，以便保留单引号或双引号的子字符串，并能够在这样的子字符串中转义引号。无引号的字符串中的引号不应与其他任何字符区别对待。带有预期输出的一些示例测试用例：

 input string        | expected output
===============================================
 'abc def'           | ['abc', 'def']
 "abc \\s def"       | ['abc', '\\s', 'def']
 '"abc def" ghi'     | ['abc def', 'ghi']
 "'abc def' ghi"     | ['abc def', 'ghi']
 '"abc \\" def" ghi' | ['abc " def', 'ghi']
 "'abc \\' def' ghi" | ["abc ' def", 'ghi']
 "'abc \\s def' ghi" | ['abc \\s def', 'ghi']
 '"abc \\s def" ghi' | ['abc \\s def', 'ghi']
 '"" test'           | ['', 'test']
 "'' test"           | ['', 'test']
 "abc'def"           | ["abc'def"]
 "abc'def'"          | ["abc'def'"]
 "abc'def' ghi"      | ["abc'def'", 'ghi']
 "abc'def'ghi"       | ["abc'def'ghi"]
 'abc"def'           | ['abc"def']
 'abc"def"'          | ['abc"def"']
 'abc"def" ghi'      | ['abc"def"', 'ghi']
 'abc"def"ghi'       | ['abc"def"ghi']
 "r'AA' r'.*_xyz$'"  | ["r'AA'", "r'.*_xyz$'"]

我最终得到了以下函数来拆分字符串，以便所有输入字符串的预期输出结果：

import re

def quoted_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") \
            for p in re.findall(r'"(?:\\.|[^"])*"|\'(?:\\.|[^\'])*\'|[^\s]+', s)]

以下测试应用程序将检查其他方法（目前为shlex和csv）和自定义拆分实现的结果：

#!/bin/python2.7

import csv
import re
import shlex

from timeit import timeit

def test_case(fn, s, expected):
    try:
        if fn(s) == expected:
            print '[ OK ] %s -> %s' % (s, fn(s))
        else:
            print '[FAIL] %s -> %s' % (s, fn(s))
    except Exception as e:
        print '[FAIL] %s -> exception: %s' % (s, e)

def test_case_no_output(fn, s, expected):
    try:
        fn(s)
    except:
        pass

def test_split(fn, test_case_fn=test_case):
    test_case_fn(fn, 'abc def', ['abc', 'def'])
    test_case_fn(fn, "abc \\s def", ['abc', '\\s', 'def'])
    test_case_fn(fn, '"abc def" ghi', ['abc def', 'ghi'])
    test_case_fn(fn, "'abc def' ghi", ['abc def', 'ghi'])
    test_case_fn(fn, '"abc \\" def" ghi', ['abc " def', 'ghi'])
    test_case_fn(fn, "'abc \\' def' ghi", ["abc ' def", 'ghi'])
    test_case_fn(fn, "'abc \\s def' ghi", ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"abc \\s def" ghi', ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"" test', ['', 'test'])
    test_case_fn(fn, "'' test", ['', 'test'])
    test_case_fn(fn, "abc'def", ["abc'def"])
    test_case_fn(fn, "abc'def'", ["abc'def'"])
    test_case_fn(fn, "abc'def' ghi", ["abc'def'", 'ghi'])
    test_case_fn(fn, "abc'def'ghi", ["abc'def'ghi"])
    test_case_fn(fn, 'abc"def', ['abc"def'])
    test_case_fn(fn, 'abc"def"', ['abc"def"'])
    test_case_fn(fn, 'abc"def" ghi', ['abc"def"', 'ghi'])
    test_case_fn(fn, 'abc"def"ghi', ['abc"def"ghi'])
    test_case_fn(fn, "r'AA' r'.*_xyz$'", ["r'AA'", "r'.*_xyz$'"])

def csv_split(s):
    return list(csv.reader([s], delimiter=' '))[0]

def re_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") for p in re.findall(r'"(?:\\.|[^"])*"|\'(?:\\.|[^\'])*\'|[^\s]+', s)]

if __name__ == '__main__':
    print 'shlex\n'
    test_split(shlex.split)
    print

    print 'csv\n'
    test_split(csv_split)
    print

    print 're\n'
    test_split(re_split)
    print

    iterations = 100
    setup = 'from __main__ import test_split, test_case_no_output, csv_split, re_split\nimport shlex, re'
    def benchmark(method, code):
        print '%s: %.3fms per iteration' % (method, (1000 * timeit(code, setup=setup, number=iterations) / iterations))
    benchmark('shlex', 'test_split(shlex.split, test_case_no_output)')
    benchmark('csv', 'test_split(csv_split, test_case_no_output)')
    benchmark('re', 'test_split(re_split, test_case_no_output)')

输出：

shlex

[ OK ] abc def -> ['abc', 'def']
[FAIL] abc \s def -> ['abc', 's', 'def']
[ OK ] "abc def" ghi -> ['abc def', 'ghi']
[ OK ] 'abc def' ghi -> ['abc def', 'ghi']
[ OK ] "abc \" def" ghi -> ['abc " def', 'ghi']
[FAIL] 'abc \' def' ghi -> exception: No closing quotation
[ OK ] 'abc \s def' ghi -> ['abc \\s def', 'ghi']
[ OK ] "abc \s def" ghi -> ['abc \\s def', 'ghi']
[ OK ] "" test -> ['', 'test']
[ OK ] '' test -> ['', 'test']
[FAIL] abc'def -> exception: No closing quotation
[FAIL] abc'def' -> ['abcdef']
[FAIL] abc'def' ghi -> ['abcdef', 'ghi']
[FAIL] abc'def'ghi -> ['abcdefghi']
[FAIL] abc"def -> exception: No closing quotation
[FAIL] abc"def" -> ['abcdef']
[FAIL] abc"def" ghi -> ['abcdef', 'ghi']
[FAIL] abc"def"ghi -> ['abcdefghi']
[FAIL] r'AA' r'.*_xyz$' -> ['rAA', 'r.*_xyz$']

csv

[ OK ] abc def -> ['abc', 'def']
[ OK ] abc \s def -> ['abc', '\\s', 'def']
[ OK ] "abc def" ghi -> ['abc def', 'ghi']
[FAIL] 'abc def' ghi -> ["'abc", "def'", 'ghi']
[FAIL] "abc \" def" ghi -> ['abc \\', 'def"', 'ghi']
[FAIL] 'abc \' def' ghi -> ["'abc", "\\'", "def'", 'ghi']
[FAIL] 'abc \s def' ghi -> ["'abc", '\\s', "def'", 'ghi']
[ OK ] "abc \s def" ghi -> ['abc \\s def', 'ghi']
[ OK ] "" test -> ['', 'test']
[FAIL] '' test -> ["''", 'test']
[ OK ] abc'def -> ["abc'def"]
[ OK ] abc'def' -> ["abc'def'"]
[ OK ] abc'def' ghi -> ["abc'def'", 'ghi']
[ OK ] abc'def'ghi -> ["abc'def'ghi"]
[ OK ] abc"def -> ['abc"def']
[ OK ] abc"def" -> ['abc"def"']
[ OK ] abc"def" ghi -> ['abc"def"', 'ghi']
[ OK ] abc"def"ghi -> ['abc"def"ghi']
[ OK ] r'AA' r'.*_xyz$' -> ["r'AA'", "r'.*_xyz$'"]

re

[ OK ] abc def -> ['abc', 'def']
[ OK ] abc \s def -> ['abc', '\\s', 'def']
[ OK ] "abc def" ghi -> ['abc def', 'ghi']
[ OK ] 'abc def' ghi -> ['abc def', 'ghi']
[ OK ] "abc \" def" ghi -> ['abc " def', 'ghi']
[ OK ] 'abc \' def' ghi -> ["abc ' def", 'ghi']
[ OK ] 'abc \s def' ghi -> ['abc \\s def', 'ghi']
[ OK ] "abc \s def" ghi -> ['abc \\s def', 'ghi']
[ OK ] "" test -> ['', 'test']
[ OK ] '' test -> ['', 'test']
[ OK ] abc'def -> ["abc'def"]
[ OK ] abc'def' -> ["abc'def'"]
[ OK ] abc'def' ghi -> ["abc'def'", 'ghi']
[ OK ] abc'def'ghi -> ["abc'def'ghi"]
[ OK ] abc"def -> ['abc"def']
[ OK ] abc"def" -> ['abc"def"']
[ OK ] abc"def" ghi -> ['abc"def"', 'ghi']
[ OK ] abc"def"ghi -> ['abc"def"ghi']
[ OK ] r'AA' r'.*_xyz$' -> ["r'AA'", "r'.*_xyz$'"]

shlex: 0.281ms per iteration
csv: 0.030ms per iteration
re: 0.049ms per iteration

因此，性能要比shlex好得多，并且可以通过预编译正则表达式来进一步提高性能，在这种情况下，它的性能将优于csv方法。

Answer 12

嗯，似乎无法找到“回复”按钮...无论如何，这个答案是基于Kate的方法，但正确地将字符串拆分为包含转义引号的子字符串，并删除了引号的开头和结尾引号子：

  [i.strip('"').strip("'") for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

这适用于像'This is " a \\\"test\\\"\\\'s substring"'这样的字符串（不幸的是，疯狂的标记是阻止Python删除转义的必要条件）。

如果不想在返回列表中的字符串中生成转义符，则可以使用该函数稍微更改的版本：

[i.strip('"').strip("'").decode('string_escape') for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

Answer 13

上面讨论的shlex的unicode问题（最佳答案）似乎在2.7.2+中间接解决（间接） http://bugs.python.org/issue6988#msg146200

（单独回答，因为我无法发表评论）

Answer 14

作为一种选择，尝试tssplit：

CustomScrollView(
  slivers: <Widget>[
    SliverAppBar(
      title: Text('Test'),
    ),
    SliverList(
      delegate: SliverChildBuilderDelegate(
        (BuildContext context, int index) {
          return Container(
            child: Container(
              child: Column(
                children: <Widget>[
                  Container(
                    child: Text(
                      'Index is $index'.toUpperCase(),
                    ),
                    alignment: Alignment.centerLeft,
                    padding: EdgeInsets.only(bottom: 10.0),
                  ),
                  Container(height: 200.0)
                ],
              ),
              constraints: BoxConstraints.tightForFinite(width: 200),
              decoration: BoxDecoration(
                color:
                    index % 2 == 0 ? Color(0XFF45766E) : Color(0XFFECB141),
                borderRadius: BorderRadius.only(
                  topLeft: Radius.circular(40.0),
                  topRight: Radius.circular(40.0),
                ),
              ),
              padding: EdgeInsets.only(
                left: 20.0,
                top: 10.0,
              ),
            ),
            decoration: BoxDecoration(
              color: index % 2 == 0 ? Color(0XFFECB141) : Color(0XFF45766E),
            ),
          );
        },
      ),
    ),
  ],
);

Answer 15

如果你不关心子字符串而不是简单的

>>> 'a short sized string with spaces '.split()

性能：

>>> s = " ('a short sized string with spaces '*100).split() "
>>> t = timeit.Timer(stmt=s)
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
171.39 usec/pass

或字符串模块

>>> from string import split as stringsplit; 
>>> stringsplit('a short sized string with spaces '*100)

性能：字符串模块似乎比字符串方法表现更好

>>> s = "stringsplit('a short sized string with spaces '*100)"
>>> t = timeit.Timer(s, "from string import split as stringsplit")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
154.88 usec/pass

或者您可以使用RE引擎

>>> from re import split as resplit
>>> regex = '\s+'
>>> medstring = 'a short sized string with spaces '*100
>>> resplit(regex, medstring)

性能

>>> s = "resplit(regex, medstring)"
>>> t = timeit.Timer(s, "from re import split as resplit; regex='\s+'; medstring='a short sized string with spaces '*100")
>>> print "%.2f usec/pass" % (1000000 * t.timeit(number=100000)/100000)
540.21 usec/pass

对于很长的字符串，你不应该将整个字符串加载到内存中，而是分割行或使用迭代循环

Answer 16

试试这个：

  def adamsplit(s):
    result = []
    inquotes = False
    for substring in s.split('"'):
      if not inquotes:
        result.extend(substring.split())
      else:
        result.append(substring)
      inquotes = not inquotes
    return result

一些测试字符串：

'This is "a test"' -> ['This', 'is', 'a test']
'"This is \'a test\'"' -> ["This is 'a test'"]

在Python中用空格分割字符串 - 保留引用的子字符串

16 个答案: