除了某些字符之外,在空格上分割

时间:2012-03-10 07:37:46

标签: python string-parsing

我正在解析一个包含

等行的文件
type("book") title("golden apples") pages(10-35 70 200-234) comments("good read")

我想把它分成不同的字段。

在我的示例中,有四个字段:类型,标题,页面和注释。

分割后的所需结果是

['type("book")', 'title("golden apples")', 'pages(10-35 70 200-234)', 'comments("good read")]

很明显,简单的字符串拆分不起作用,因为它只会在每个空间分开。 我想拆分空格,但保留括号和引号之间的任何内容。

我怎么能分开这个?

4 个答案:

答案 0 :(得分:12)

此正则表达式适用于您\s+(?=[^()]*(?:\(|$))

result = re.split(r"\s+(?=[^()]*(?:\(|$))", subject)

解释

r"""
\s             # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
   +              # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?=            # Assert that the regex below can be matched, starting at this position (positive lookahead)
   [^()]          # Match a single character NOT present in the list “()”
      *              # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   (?:              # Match the regular expression below
                     # Match either the regular expression below (attempting the next alternative only if this one fails)
         \(             # Match the character “(” literally
      |              # Or match regular expression number 2 below (the entire group fails if this one fails to match)
         $              # Assert position at the end of a line (at the end of the string or before a line break character)
   )
)
"""

答案 1 :(得分:2)

") "上拆分,然后将)添加回除最后一个元素之外的每个元素。

答案 2 :(得分:1)

我会尝试使用积极的后视断言。

r'(?<=\))\s+'

示例:

>>> import re
>>> result = re.split(r'(?<=\))\s+', 'type("book") title("golden apples") pages(10-35 70 200-234) comments("good read")')
>>> result
['type("book")', 'title("golden apples")', 'pages(10-35 70 200-234)', 'comments(
"good read")']

答案 3 :(得分:1)

让我添加一个非正则表达式解决方案:

line = 'type("book") title("golden apples") pages(10-35 70 200-234) comments("good read")'

count = 0 # Bracket counter
last_break = 0 # Index of the last break
parts = []
for j,char in enumerate(line):
    if char is '(': count += 1
    elif char is ')': count -= 1
    elif char is ' ' and count is 0:
        parts.append(line[last_break:(j)])
        last_break = j+1
parts.append(line[last_break:]) # Add last element
parts = tuple(p for p in parts if p) # Convert to tuple and remove empty

for p in parts:
    print(p)

一般来说,您cannot do with regular expressions有某些事情,并且可能会受到严重的性能损失(尤其是对于超前和向后看),这可能会导致它们不是某个问题的最佳解决方案。

也;我以为我提到了pyparsing模块,该模块可用于创建自定义文本解析器。