Question

我正在使用python和regex，我正在尝试转换字符串，如下所示：

(1694439,805577453641105408,'\"@Bessemerband not reverse gear  simply pointing out that I didn\'t say what you claim I said. I will absolutely riot if (Brexit) is blocked.\"',2887640,NULL,NULL,NULL),(1649240,805577446758158336,'\"Ugh FFS the people you use to look up to fail to use critical thinking. Smh. He did the same thing with brexit :( \"',2911510,NULL,NULL,NULL),

进入如下列表：

[
    [1694439, 805577453641105408, '\"@Bessemerband not reverse gear  simply pointing out that I didn\'t say what you claim I said. I will absolutely riot if (Brexit) is blocked.\"', 2887640, NULL, NULL, NULL],
    [1649240, 805577446758158336, '\"Ugh FFS the people you use to look up to fail to use critical thinking. Smh. He did the same thing with brexit :(\"', 2911510, NULL, NULL, NULL]
]

这里的主要问题是，正如您所看到的，文本中还有一些括号，我不想分开。我已经尝试了\([^)]+\)之类的内容，但很明显，这会在它找到的第一个)处分裂。

任何线索如何解决这个问题？

Answer 1

这是您正在寻找的输出吗？

big = """(1694439,805577453641105408,'\"@Bessemerband not reverse gear  simply pointing out that I didn\'t say what you claim I said. I will absolutely riot if (Brexit) is blocked.\"',2887640,NULL,NULL,NULL),(1649240,805577446758158336,'\"Ugh FFS the people you use to look up to fail to use critical thinking. Smh. He did the same thing with brexit :( \"',2911510,NULL,NULL,NULL),"""
small = big.split('),')
print(small)

我正在做的是分裂),然后只是循环并在正常情况下分割逗号。我将展示一个可以优化的基本方法：

new_list = []

for x in small:
    new_list.append(x.split(','))
print(new_list)

现在的缺点是，有一个空列表，但你可以稍后放弃它。

Answer 2

这是一个简单的正则表达式解决方案，可以捕获不同组中每个逗号分隔的值：

\(([^,]*),([^,]*),'((?:\\.|[^'])*)',([^,]*),([^,]*),([^,]*),([^)]*)

用法：

input_string = r"""(1694439,805577453641105408,'\"@Bessemerband not reverse gear  simply pointing out that I didn\'t say what you claim I said. I will absolutely riot if (Brexit) is blocked.\"',2887640,NULL,NULL,NULL),(1649240,805577446758158336,'\"Ugh FFS the people you use to look up to fail to use critical thinking. Smh. He did the same thing with brexit :( \"',2911510,NULL,NULL,NULL),"""

import re
result = re.findall(r"\(([^,]*),([^,]*),'((?:\\.|[^'])*)',([^,]*),([^,]*),([^,]*),([^)]*)", input_string)

Answer 3

嵌套括号在这里不是问题，因为它们被引号括起来。您所要做的就是分别匹配引用的部分：

import re

pat = re.compile(r"[^()',]+|'[^'\\]*(?:\\.[^'\\]*)*'|(\()|(\))", re.DOTALL)

s = r'''(1694439,805577453641105408,'\"@Bessemerband not reverse gear  simply pointing out that I didn\'t say what you claim I said. I will absolutely riot if (Brexit) is blocked.\"',2887640,NULL,NULL,NULL),(1649240,805577446758158336,'\"Ugh FFS the people you use to look up to fail to use critical thinking. Smh. He did the same thing with brexit :( \"',2911510,NULL,NULL,NULL),'''

result = []

for m in pat.finditer(s):
    if m.group(1):
        tmplst = []
    elif m.group(2):
        result.append(tmplst)        
    else:
        tmplst.append(m.group(0))

print(result)

如果您的字符串也可以包含引号之间没有括号的括号，则可以使用regex module 的递归模式解决问题（使用它并且csv模块是个好主意）或建立状态机。

Python：正则表达式匹配括号内的任何内容（也包括其他括号）

3 个答案: