正则表达式、熊猫和标记行

时间:2021-04-26 19:45:58

标签: python regex pandas dataframe

我正在尝试标记包含用户定义的“不正确”字符的任何记录。在这种情况下,记录二 (2) 应作为非有效记录返回,但我似乎正在捕获记录 1 或 3。这些将被视为“正确”。 关于为什么这些是标记而不是“错误记录”的任何建议?

import pandas as pd
import numpy as np
import re

data = {'HOME1': ['123 Main St', '567\ Country Road', 'PO Box 900']}
dft = pd.DataFrame(data)

from itertools import chain
chars =[]
acceptable = [x for x in chain(range(48,58),range(32,33), range(65,91), range(97,123))]
for ch in acceptable:
    chars.append(chr(ch))

reg_list = map(re.compile,chars)

for x in dft['HOME1']:
    print(x)
    if any(re.match(x) for re in reg_list):
        conditions = [dft['HOME1'].apply(lambda x: x)!=x, dft['HOME1'].apply(lambda x: x)==x]
        choices = [0,1]
        dft['NonValidHOME1'] = np.select(conditions,choices,default=0)

try:
    print(dft.groupby(['NonValidHOME1'])[['HOME1']].filter(lambda x: len(x) ==1).agg(lambda x: x.tolist()))
except:
    print("no invalid Home1")


1 个答案:

答案 0 :(得分:0)

for x in dft['HOME1']:
for c in x:
    if c not in chars:
        print(c,x)
        conditions = [dft['HOME1'].apply(lambda x: x)==x, dft['HOME1'].apply(lambda x: x)!=x]
        choices = [1,0]
        dft['NonValidHOME1'] = np.select(conditions,choices,default=0)

#[print(c) for x in dft['HOME1'] for c in x if c not in chars]

        

感谢您的评论。这让我走上了一条“更好”的道路,或者至少是一条让我找到答案的道路。