我正在尝试使用熊猫复制here中描述的(非常酷)数据匹配方法。目标是获取记录的组成部分(令牌)并用于与另一个df匹配。
我一直在努力弄清楚如何保留源ID并与各个令牌关联。希望有人对我该如何做有一个聪明的建议。我搜索了Stack,但找不到类似的问题。
以下是一些示例数据和核心代码来说明。这需要一个数据帧,对选择列进行令牌化,生成令牌,令牌类型和ID(但ID部分无效):
d = {'Id': [3,6], 'Org_Name': ['Acme Co Inc.', 'Buy Cats Here LLC'], 'Address': ['123 Hammond Lane', 'Washington, DC 20456']}
df = pd.DataFrame(data=d)
def tokenize_name(name):
if isinstance(name, basestring) is True:
clean_name = ''.join(c if c.isalnum() else ' ' for c in name)
return clean_name.lower().split()
else:
return name
def tokenize_address(address):
if isinstance(address, basestring) is True:
clean_name = ''.join(c if c.isalnum() else ' ' for c in address)
return clean_name.lower().split()
else:
return address
left_tokenizers = [
('Org_Name', 'name_tokens', tokenize_name),
('Address', 'address_tokens', tokenize_address)
]
#this works except for ID references
def prepare_join_keys(df, tokenizers):
for source_column, key_name, tokenizer in tokenizers:
for index in df.index:
if source_column in df.columns:
for record in df[source_column]:
if isinstance(record, float) is False:
for token in tokenizer(record):
yield (token, key_name, df.iloc[index]['Id'])
for item in prepare_join_keys(df, left_tokenizers):
print item
此代码产生正确的令牌,但产生所有令牌的ID值,而不是仅对应的ID值。我知道我这里有错,但是我想不出一种使用生成器函数执行此操作的方法。所需的输出将是:
acme, name_tokens, 3
co, name_tokens, 3
inc, name_tokens, 3
buy, name_tokens, 6
cats, name_tokens, 6
here, name_tokens, 6
llc, name_tokens, 6
123, address_tokens, 3
hammond, address_tokens, 3
etc.
答案 0 :(得分:0)
您需要更改Id
的索引,而不是在专用的for
循环中更改,但是同时您会获得一条新记录。我建议像这样:
def prepare_join_keys(df, tokenizers):
for source_column, key_name, tokenizer in tokenizers:
# for index in df.index:
if source_column in df.columns:
for index, record in enumerate(df[source_column]):
if isinstance(record, float) is False:
for token in tokenizer(record):
yield (token, key_name, df.iloc[index]['Id'])