Question

我遇到了问题

我有一个名为“雇主”的数据框，看起来像：

employer
------------
wings brand activation i pvt ltd
hofincons infotech &industrial services pvt .ltd
bharat fritz werner bangalore
kludi rak indpvt ltd.

另一个将雇主名称映射到看起来像（称为密码）类别的数据框：

Index   Name                                    FINAL_CATEGORY
68781   central board of excise and customs     cat b
68782   c a g hotels pvt ltd                    cat b
68783   avaneetha textiles pvt ltd              cat a
68784   trendy wheels pvt ltd                   cat a+
68785   wings brand activations india pvt ltd   cat b

现在我想模拟类似的东西：

pincode[pincode['Compnay Name'].str.contains('wings brand activation i pvt ltd')]

Compnay Name    FINAL_CATEGORY
____________________________________

pincode[pincode['Compnay Name'].str.contains('wings brand activation i pvt')]

Compnay Name    FINAL_CATEGORY
____________________________________

pincode[pincode['Compnay Name'].str.contains('wings brand activation i')]


Compnay Name    FINAL_CATEGORY
____________________________________

pincode[pincode['Compnay Name'].str.contains('wings brand activation')]

        Name                                    FINAL_CATEGORY
68785   wings brand activations india pvt ltd   cat b

如您所见，对于每个字符串，我都在减小从字符串末尾到最后一个空格的长度，然后进行搜索。

以上内容需要与（我认为正则表达式）放在一起。这样，对于雇主表中的每个条目，它都会搜索整个密码范围并找出最接近的匹配项。如果没有，则返回nan。

在此先感谢您，因为问题很难用语言表达，请要求任何澄清。

Answer 1

您可以使用以下迭代方法：

alignas(32)

def find_substr(employer, pincode):
    employer = employer.set_index("employer")
    for words in employer.index.map(str.split):
        length = len(words)
        found = False
        while length > 0 and not found:
            substr = ' '.join(words[:length]).replace('(', '\(')
            mask = pincode.Name.str.contains(substr)
            if mask.any():
                employer.loc[' '.join(words), 'cat'] = pincode.loc[mask, 'FINAL_CATEGORY'].values[0]
                found = True
            length -= 1
    employer = employer.reset_index()
    return employer

employer = find_substr(employer, pincode)
print(employer)

Answer 2

这是一种方法。

首先将df引脚转换为字典，该字典将字符串映射到相应的类别。然后使用双重列表理解来创建员工数据框的cat列，以记录与他的姓名匹配的所有类别：

# Example df
employer = pd.DataFrame({"employer":["wings brand activation i pvt ltd", "bharat fritz werner bangalore"]})
pins = pd.DataFrame({"Name":["trendy wheels pvt ltd", "wings brand activation i pvt ltd"], "FINAL_CATEGORY":["cat a+", "cat b"]}) 

dict_pins = dict(zip(pins['Name'], pins['FINAL_CATEGORY']))
employer['cat'] = [[dict_pins[key] for key in dict_pins.keys() if x in key] for x in employer['employer']]

输出：

                           employer      cat
0  wings brand activation i pvt ltd  [cat b]
1     bharat fritz werner bangalore       []

在python数据框的某个范围内查找正则表达式

2 个答案: