Question

假设此字符串：

cropped_images = "GrabCut"

if not os.path.exists(cropped_images):
    os.makedirs(cropped_images)

# Load data
filepath = "Data"
orig_files = [file for file in glob.glob(filepath+"/*.jpg")]
new_files = [os.path.join(cropped_images, os.path.basename(f)) for f in orig_files]

for orig_f,new_f in zip(orig_files,new_files):
    image = cv2.imread(orig_f)

    DO SOMETHING...

    cv2.imwrite(new_f, image)

我想要这样的键，值对：

[aaa   ] some text here [bbbb3 ] some other text here [cc    ] more text

或像这样的pandas DataFrame

Key      Value
aaa      some text here  
bbbb3    some other text here  
cc       more text

Answer 1

尝试使用此正则表达式，它可以在命名组捕获中捕获您的键和值。

\[\s*(?P<key>\w+)+\s*]\s*(?P<value>[^[]*\s*)

说明：

\[->由于[具有定义字符集的特殊含义，因此需要转义并匹配文字[
\s*->占用不需要的部分键之前的预期键之前的任何空格
(?P<key>\w+)+->组成一个key命名组，捕获一个或多个单词[a-zA-Z0-9_]字符。我使用\w来简化它，因为OP的字符串仅包含字母数字字符，否则应该使用[^]]字符集来捕获方括号内的所有内容作为键。
\s*->在不需要键的一部分的预期键捕获后占用以下空格
]->匹配不需要转义的文字]
\s*->占用所有前面不需要的空格
(?P<value>[^[]*\s*)->形成一个value命名组，捕获任何字符异常[，此时它将停止捕获并将捕获的值分组到命名组value中。 / li>

Demo

Python代码，

import re
s = '[aaa   ] some text here [bbbb3 ] some other text here [cc    ] more text'

arr = re.findall(r'\[\s*(?P<key>\w+)+\s*]\s*(?P<value>[^[]*\s*)', s)
print(arr)

输出

[('aaa', 'some text here '), ('bbbb3', 'some other text here '), ('cc', 'more text')]

Answer 2

使用re.findall，然后将感兴趣的区域提取到列中。然后，您可以根据需要删除空格。

由于您提到您愿意将其读入DataFrame，因此您可以将该工作留给熊猫。

import re
matches = re.findall(r'\[(.*?)\](.*?)(?=\[|$)', text)

df = (pd.DataFrame(matches, columns=['Key', 'Value'])
        .apply(lambda x: x.str.strip()))

df
     Key                 Value
0    aaa        some text here
1  bbbb3  some other text here
2     cc             more text

或（重新：修改），

df = (pd.DataFrame(matches, columns=['Key', 'Value'])
        .apply(lambda x: x.str.strip())
        .set_index('Key')
        .transpose())

Key               aaa                 bbbb3         cc
Value  some text here  some other text here  more text

该模式与括号内的文本匹配，然后匹配外框直到下一个开口括号的文本。

\[      # Opening square brace 
(.*?)   # First capture group
\]      # Closing brace
(.*?)   # Second capture group
(?=     # Look-ahead 
   \[   # Next brace,
   |    # Or,
   $    # EOL
)

Answer 3

您可以使用re.split()最小化所需的正则表达式并输出到字典。例如：

import re

text = '[aaa   ] some text here [bbbb3 ] some other text here [cc    ] more text'

# split text on "[" or "]" and slice off the first empty list item
items = re.split(r'[\[\]]', text)[1:]

# loop over consecutive pairs in the list to create a dict
d = {items[i].strip(): items[i+1].strip() for i in range(0, len(items) - 1, 2)}

print(d)
# {'aaa': 'some text here', 'bbbb3': 'some other text here', 'cc': 'more text'}

Answer 4

这里实际上不需要正则表达式-简单的字符串拆分即可完成工作：

s = "[aaa   ] some text here [bbbb3 ] some other text here [cc    ] more text"    

parts = s.split("[")  # parts looks like: ['', 
                      #                    'aaa   ] some text here ',
                      #                    'bbbb3 ] some other text here ', 
                      #                    'cc    ] more text'] 
d = {}
# split parts further
for p in parts:
    if p.strip():
        key,value = p.split("]")            # split each part at ] and strip spaces
        d[key.strip()] = value.strip()      # put into dict

# Output:
form = "{:10} {}"
print( form.format("Key","Value"))

for i in d.items():
      print(form.format(*i))

输出：

Key        Value
cc         more text
aaa        some text here
bbbb3      some other text here

Doku进行格式化：

接近1线：

d = {hh[0].strip():hh[1].strip() for hh in (k.split("]") for k in s.split("[") if k)}

Answer 5

您可以使用finditer：

import re

s = '[aaa   ] some text here [bbbb3 ] some other text here [cc    ] more text'

pattern = re.compile('\[(\S+?)\s+\]([\s\w]+)')
result = [(match.group(1).strip(), match.group(2).strip()) for match in pattern.finditer(s)]
print(result)

输出

[('aaa', 'some text here'), ('bbbb3', 'some other text here'), ('cc', 'more text')]

Answer 6

使用RegEx，您可以找到key,value对，将它们存储在字典中并打印出来：

import re

mystr = "[aaa   ] some text here [bbbb3 ] some other text here [cc    ] more text"

a = dict(re.findall(r"\[([A-Za-z0-9_\s]+)\]([A-Za-z0-9_\s]+(?=\[|$))", mystr))

for key, value in a.items():
    print key, value

# OUTPUT: 
# aaa     some text here 
# cc      more text
# bbbb3   some other text here

RegEx匹配2个组：
第一组是用方括号括起来的所有字符，数字和空格，第二组是所有字符，数字和空格，其后是一个封闭的方括号，然后是一个开放的方括号或行尾

第一组：\[([A-Za-z0-9_\s]+)\]
第二组：([A-Za-z0-9_\s]+(?=\[|$))

请注意，在第二组中，我们有一个positive lookahead：(?=\[|$)。如果没有正向的前瞻，字符将被消耗掉，下一组将找不到起始方括号。

findall返回一个元组列表：[(key1,value1), (key2,value2), (key3,value3),...]。
元组列表可以立即转换成字典：dict（my_tuple_list）。

有了字典后，就可以使用键/值对：）

从包含方括号（日志文件）的文本中提取键/值对

6 个答案: