在python数据框的某个范围内查找正则表达式

时间:2019-11-28 10:46:49

标签: python regex pandas

我遇到了问题

我有一个名为“雇主”的数据框,看起来像:

employer
------------
wings brand activation i pvt ltd
hofincons infotech &industrial services pvt .ltd
bharat fritz werner bangalore
kludi rak indpvt ltd.

另一个将雇主名称映射到看起来像(称为密码)类别的数据框:

Index   Name                                    FINAL_CATEGORY
68781   central board of excise and customs     cat b
68782   c a g hotels pvt ltd                    cat b
68783   avaneetha textiles pvt ltd              cat a
68784   trendy wheels pvt ltd                   cat a+
68785   wings brand activations india pvt ltd   cat b

现在我想模拟类似的东西:

pincode[pincode['Compnay Name'].str.contains('wings brand activation i pvt ltd')]

Compnay Name    FINAL_CATEGORY
____________________________________

pincode[pincode['Compnay Name'].str.contains('wings brand activation i pvt')]

Compnay Name    FINAL_CATEGORY
____________________________________

pincode[pincode['Compnay Name'].str.contains('wings brand activation i')]


Compnay Name    FINAL_CATEGORY
____________________________________

pincode[pincode['Compnay Name'].str.contains('wings brand activation')]

        Name                                    FINAL_CATEGORY
68785   wings brand activations india pvt ltd   cat b

如您所见,对于每个字符串,我都在减小从字符串末尾到最后一个空格的长度,然后进行搜索。

以上内容需要与(我认为正则表达式)放在一起。这样,对于雇主表中的每个条目,它都会搜索整个密码范围并找出最接近的匹配项。如果没有,则返回nan。

在此先感谢您,因为问题很难用语言表达,请要求任何澄清。

2 个答案:

答案 0 :(得分:1)

您可以使用以下迭代方法:

alignas(32)
def find_substr(employer, pincode):
    employer = employer.set_index("employer")
    for words in employer.index.map(str.split):
        length = len(words)
        found = False
        while length > 0 and not found:
            substr = ' '.join(words[:length]).replace('(', '\(')
            mask = pincode.Name.str.contains(substr)
            if mask.any():
                employer.loc[' '.join(words), 'cat'] = pincode.loc[mask, 'FINAL_CATEGORY'].values[0]
                found = True
            length -= 1
    employer = employer.reset_index()
    return employer

employer = find_substr(employer, pincode)
print(employer)

答案 1 :(得分:0)

这是一种方法。

首先将df引脚转换为字典,该字典将字符串映射到相应的类别。然后使用双重列表理解来创建员工数据框的cat列,以记录与他的姓名匹配的所有类别:

# Example df
employer = pd.DataFrame({"employer":["wings brand activation i pvt ltd", "bharat fritz werner bangalore"]})
pins = pd.DataFrame({"Name":["trendy wheels pvt ltd", "wings brand activation i pvt ltd"], "FINAL_CATEGORY":["cat a+", "cat b"]}) 

dict_pins = dict(zip(pins['Name'], pins['FINAL_CATEGORY']))
employer['cat'] = [[dict_pins[key] for key in dict_pins.keys() if x in key] for x in employer['employer']]

输出:

                           employer      cat
0  wings brand activation i pvt ltd  [cat b]
1     bharat fritz werner bangalore       []