我遇到了问题
我有一个名为“雇主”的数据框,看起来像:
employer
------------
wings brand activation i pvt ltd
hofincons infotech &industrial services pvt .ltd
bharat fritz werner bangalore
kludi rak indpvt ltd.
另一个将雇主名称映射到看起来像(称为密码)类别的数据框:
Index Name FINAL_CATEGORY
68781 central board of excise and customs cat b
68782 c a g hotels pvt ltd cat b
68783 avaneetha textiles pvt ltd cat a
68784 trendy wheels pvt ltd cat a+
68785 wings brand activations india pvt ltd cat b
现在我想模拟类似的东西:
pincode[pincode['Compnay Name'].str.contains('wings brand activation i pvt ltd')]
Compnay Name FINAL_CATEGORY
____________________________________
pincode[pincode['Compnay Name'].str.contains('wings brand activation i pvt')]
Compnay Name FINAL_CATEGORY
____________________________________
pincode[pincode['Compnay Name'].str.contains('wings brand activation i')]
Compnay Name FINAL_CATEGORY
____________________________________
pincode[pincode['Compnay Name'].str.contains('wings brand activation')]
Name FINAL_CATEGORY
68785 wings brand activations india pvt ltd cat b
如您所见,对于每个字符串,我都在减小从字符串末尾到最后一个空格的长度,然后进行搜索。
以上内容需要与(我认为正则表达式)放在一起。这样,对于雇主表中的每个条目,它都会搜索整个密码范围并找出最接近的匹配项。如果没有,则返回nan。
在此先感谢您,因为问题很难用语言表达,请要求任何澄清。
答案 0 :(得分:1)
您可以使用以下迭代方法:
alignas(32)
def find_substr(employer, pincode):
employer = employer.set_index("employer")
for words in employer.index.map(str.split):
length = len(words)
found = False
while length > 0 and not found:
substr = ' '.join(words[:length]).replace('(', '\(')
mask = pincode.Name.str.contains(substr)
if mask.any():
employer.loc[' '.join(words), 'cat'] = pincode.loc[mask, 'FINAL_CATEGORY'].values[0]
found = True
length -= 1
employer = employer.reset_index()
return employer
employer = find_substr(employer, pincode)
print(employer)
答案 1 :(得分:0)
这是一种方法。
首先将df引脚转换为字典,该字典将字符串映射到相应的类别。然后使用双重列表理解来创建员工数据框的cat列,以记录与他的姓名匹配的所有类别:
# Example df
employer = pd.DataFrame({"employer":["wings brand activation i pvt ltd", "bharat fritz werner bangalore"]})
pins = pd.DataFrame({"Name":["trendy wheels pvt ltd", "wings brand activation i pvt ltd"], "FINAL_CATEGORY":["cat a+", "cat b"]})
dict_pins = dict(zip(pins['Name'], pins['FINAL_CATEGORY']))
employer['cat'] = [[dict_pins[key] for key in dict_pins.keys() if x in key] for x in employer['employer']]
输出:
employer cat
0 wings brand activation i pvt ltd [cat b]
1 bharat fritz werner bangalore []