Question

我正在研究文本模式问题。我有以下输入-

term = 'CG-14/0,2-L-0_2'

我需要从输入项中删除所有可能的标点符号（定界符）。基本上，我需要输入项中的以下输出-

'CG1402L02'

在删除定界符之前，我还需要存储（以任何格式（对象，字典，元组等））定界符和定界符的位置。

输出示例（如果以元组形式返回）-

((-,2), (/,5), (,,7), (-,9), (-,11), (_,13))

我可以使用以下python代码获取输出-

re.sub(r'[^\w]', '', term.replace('_', ''))

但是在删除定界符之前，如何存储定界符和定界符位置（以最有效的方式）？

Answer 1

您可以执行以下操作，将所需的其他任何定界符添加到列表delims

term = 'CG-14/0,2-L-0_2'   
delims = ['-','/',',','_']
locations = []
pos = 0
for c in term: ##iterate through the characters in the string
    if c in delims:
        locations.append([c,pos]) ##store the character and its original position 
    pos+=1

然后执行re.sub命令替换它们。

Answer 2

您只需走过term一次，即可收集途中的所有必要信息：

from string import ascii_letters,digits

term = 'CG-14/0,2-L-0_2'

# defined set of allowed characters a-zA-Z0-9
# set lookup is O(1) - fast
ok = set(digits +ascii_letters)  

specials = {}
clean = []
for i,c in enumerate(term):
    if c in ok:
        clean.append(c)
    else:
        specials.setdefault(c,[])
        specials[c].append(i)

cleaned = ''.join(clean)

print(clean)
print(cleaned)
print(specials)

输出：

['C', 'G', '1', '4', '0', '2', 'L', '0', '2']     # list of characters in set ok 
CG1402L02                                         # the ''.join()ed list 

{'-': [2, 9, 11], '/': [5], ',': [7], '_': [13]}  # dict of characters/positions not in ok

请参阅：

您可以使用

specials = []

并在迭代内：

else:
    specials.append((c,i))

获取元组列表而不是字典：

[('-', 2), ('/', 5), (',', 7), ('-', 9), ('-', 11), ('_', 13)]

在python中替换定界符和定界符位置之前，请先存储它们

2 个答案: