Question

问题：是否可以向量化两个DataFrame / Series的字符串匹配？

概念：我有两个DataFrame（df_address，df_world_city）：

df_address：包含带有地址数据的列（例如“ Sherlock Str .; Paris;”）
df_world_city：包含一列，其中包含城市名称和相应的国家（“ FRA”，“巴黎”）

我仔细检查每个地址，然后尝试与所有城市进行匹配，以找出地址中提到的哪个城市，并在其中添加相应的国家/地区。匹配的城市将保存在一个列表中，该列表是以国家为键的字典值（{'FRA'：['Paris']}）。

此刻，我主要使用for循环来遍历地址和城市以匹配它们。使用多处理（48个进程）和大量数据（df_address：160,000行； df_wordl_city：2,200,000行），大约需要4-5天。

def regex_city_matching(target, location):

    if type(target) != str or type(location) != str or len(target) <= 3:
        # Skip NaN and to short cities
        return False
    # Match city only as full word, not a substring of another word
    pattern = re.compile('(^|[\W])' + re.escape(target) + '($|[\W])', re.IGNORECASE)
    result = re.search(pattern, location)
    if result:
        return True
    return False


def city_matching_no_country_multi_dict_simple(self, df_world_city, df_address):

 col_names = ['node_id', 'name', 'city_iso']
 df_matched_city_no_country = pd.DataFrame(columns=col_names)

 for index_city in df_world_city.index:
     # Iterate over each city
     w_city = df_world_city.at[index_city, 'city']
     if type(w_city) != str or len(w_city) <= 3:
         # Skip NaN and to short cities
         continue

     w_country = df_world_city.at[index_city, 'iso']

     for ind_address in df_address.index:
         if self.regex_city_matching(w_city, df_address.at[ind_address, 'name']):
             node_id = df_address.at[ind_address, 'node_id']
             address = df_address.at[ind_address, 'name']
             if (df_matched_city_no_country['node_id'] == node_id).any():
                 # append new city / country
                 ind_append_address = df_matched_city_no_country.loc[df_matched_city_no_country.node_id == node_id].index[0]
                          if w_country in df_matched_city_no_country.at[ind_append_address, 'city_iso']:
                     # Country in dictionary
                     df_matched_city_no_country.at[ind_append_address, 'city_iso'][w_country].append(w_city)
                 else:
                     # Country not in dictionary
                     df_matched_city_no_country.at[ind_append_address, 'city_iso'][w_country] = [w_city]
             else:
                 # add new address with city / country
                 dict_iso_city = {w_country: [w_city]}
                 df_matched_city_no_country = df_matched_city_no_country.append(
                     {'node_id': node_id, 'name': address, 'city_iso': dict_iso_city},
                     ignore_index=True)

return df_matched_city_no_country

编辑：谢谢@lenik！与一组城市的匹配要高效得多，而且速度很快。

但是它没有完全实现，因为测试表明虚假阳性的数量很高。

Answer 1

您应该使用{ 'city' : 'COUNTRY', }来制作反字典，这样就不必遍历，只需在常数（O（1））时间内直接访问即可。

除了我可以使set()成为已知城市之外，因此我不需要遍历任何内容，只需快速查找即可，而且我知道这个城市是否未知。

最后，我将简化地址解析，而无需使用非常昂贵的正则表达式，将所有字符转换为大写或小写，将非字母字符替换为空格，仅.split()即可得到单词列表，而不是现在正在做。

完成所有这些更改后，在200万个已知城市中处理16万个地址可能需要10到15秒。

请告诉我您是否需要代码示例？

字符串匹配的向量化

1 个答案: