Question

我已使用 re.search 从较大的字符串中获取 uniqueID 的字符串。例如：

import re

string= 'example string with this uniqueID: 300-350'
combination = '(\d+)[-](\d+)' 
m = re.search(combination, string)
print (m.group(0))

Out: '300-350'

我创建了一个以 UniqueID 和 Combination 作为列的数据框。

    uniqueID    combinations
0   300-350     (\d+)[-](\d+)
1   off-250     (\w+)[-](\d+)
2   on-stab     (\w+)[-](\w+)

还有一个字典意义组合，将该组合与它表示的变量含义相关联：

meaning_combination={'(\\d+)[-](\\d+)': 'A-B',
 '(\\w+)[-](\\d+)': 'C-A',
 '(\\w+)[-](\\w+)': 'C-D'}

我想为每个变量（A，B，C，D）创建新列，并为其填充相应的值。

最终结果应如下所示：

    uniqueID    combinations   A    B   C     D
0   300-350     (\d+)[-](\d+)  300  350 
1   off-250     (\w+)[-](\d+)       250       off
2   on-stab     (\w+)[-](\w+)           stab  on

Answer 1

我会将您的正则表达式修复为：

meaning_combination={'(\d+-\d+)': 'A-B',
 '([^0-9\W]+\-\d+)': 'C-A',
 '([^0-9\W]+\-[^0-9\W]+)': 'C-D'}

要捕获整个组，而不要具有三个捕获组。

即(300-350, 300, 350)-> (300-350)

您不需要额外的两个捕获组，因为如果满足特定的模式，则您将知道单词或数字字符的位置（根据您对模式的定义），并且可以除以{{ 1}}来分别访问它们。

即：

如果使用这种方式，则可以遍历字典键和字符串列表，并测试字符串中是否满足模式。如果满意（str = 'example string with this uniqueID: 300-350' values = re.findall('(\d+-\d+)', str) >>>['300-350'] #first digit char: values[0].split('-')[0] >>>'300'），则获取键的相应字典值并将其拆分并拆分匹配项，并在循环中创建的新字典中分配len(re.findall(pattern, string)) != 0和dictionary_value.split('-')[0] : match[0].split('-')[0] -还为完全匹配值分配唯一ID，并为匹配的模式分配组合。然后使用熊猫制作一个数据框。

一起：

dictionary_value.split('-')[1] : match[0].split('-')[1]

输出：

import re
import pandas as pd

stri= ['example string with this uniqueID: 300-350', 'example string with this uniqueID: off-250', 'example string with this uniqueID: on-stab']

meaning_combination={'(\d+-\d+)': 'A-B',
 '([^0-9\W]+\-\d+)': 'C-A',
 '([^0-9\W]+\-[^0-9\W]+)': 'C-D'}

values = [{'Unique ID': re.findall(x, st)[0], 'Combination': x, y.split('-')[0] : re.findall(x, st)[0].split('-')[0], y.split('-')[1] : re.findall(x, st)[0].split('-')[1]} for st in stri for x, y in meaning_combination.items() if len(re.findall(x, st)) != 0]


df = pd.DataFrame.from_dict(values)

#just to sort it in order since default is alphabetical 
col_val = ['Unique ID', 'Combination', 'A', 'B', 'C', 'D']

df = df.reindex(sorted(df.columns, key=lambda x: col_val.index(x) ), axis=1)
print(df)

另外，请注意，我认为您的预期输出中有错别字，因为您有： Unique ID Combination A B C D 0 300-350 (\d+-\d+) 300 350 NaN NaN 1 off-250 ([^0-9\W]+\-\d+) 250 NaN off NaN 2 on-stab ([^0-9\W]+\-[^0-9\W]+) NaN NaN on stab

与'(\\w+)[-](\\d+)': 'C-A'匹配，但最终结果是：

off-250

根据密钥，该密钥应位于uniqueID combinations A B C D 1 off-250 (\w+)[-](\d+) 250 off和C中。

根据字符串拆分其他列创建新列

1 个答案: