我正在构建一个数据匹配脚本,该脚本将令牌上的两个数据集连接在一起。该代码可以运行,但是具有大量记录和标记化字段,因此需要很长时间才能完成。我正在寻找有关如何使计算效率更高的建议。
我会注意表现不佳的区域,但首先要介绍一下上下文:
#example df
d = {'id': [3,6], 'Org_Name': ['Acme Co Inc.', 'Buy Cats Here Inc'], 'Address': ['123 Hammond Lane, Washington, DC', 'Washington, DC 20456']}
left_df = pd.DataFrame(data=d)
# example tokenizer
def tokenize_name(name):
if isinstance(name, basestring) is True:
clean_name = ''.join(c if c.isalnum() else ' ' for c in name)
return clean_name.lower().split()
else:
return name
#tokenizers assigned to columns
left_tokenizers = [
('Org_Name', tokenize_name),
('Address', tokenize_name)
]
#example token dictionary
tokens_dct = {
'acme':1,
'co':1,
'inc':0,
'buy':1,
'cats':1,
'here':1,
'123':1,
'hammond':1,
'lane':0,
'washington':1,
'dc':1,
'20456':1
}
#this is the generator function used to create token/ID pairs
def prepare_join_keys(df, tokenizers):
for source_column, tokenizer in tokenizers:
if source_column in df.columns:
for index, record in enumerate(df[source_column]):
if isinstance(record, numbers.Integral) is False: #control for longs
if isinstance(record, float) is False: #control for nans
for token in tokenizer(record):
if tokens_dct[token] == 1: #tokenize only for tokens present in dictionary with value 1
yield (token, df.iloc[index]['id'])
# THIS CODE TAKES A LONG TIME TO RUN
left_keyed = pd.DataFrame(columns=('token', 'id'))
for item in prepare_join_keys(left_df, left_tokenizers):
left_keyed.loc[len(left_keyed)] = item
left_keyed
使用字典来裁剪通用令牌(LLC,Corp,www等),但是对于许多令牌来说,这在计算上仍然很昂贵。我想知道,将生成的令牌/ ID对插入数据帧的方式效率低下吗?有一个更好的方法吗?还想知道我是否通过使用if而不是elif来犯下计算错误。
谢谢。
答案 0 :(得分:0)
在熊猫中没有这样做的真正理由。使用预建的标记器效率更高。这应该做您想要的。
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import pandas as pd
# since you have a predefined vocabulary, you can fix it here
vocabulary = np.array([w for w, b in tokens_dct.items() if b])
cv = CountVectorizer( vocabulary=vocabulary)
frame_list = []
for colname in ['Org_Name', 'Address']:
tokenmapping = cv.fit_transform(left_df[colname])
df_row, token_id = tokenmapping.nonzero()
frame_list.append(pd.DataFrame(np.vstack([vocabulary[token_id], left_df['id'].values[df_row]]).T, columns = ['token', 'id']))
left_keyed = pd.concat(frame_list)