我正在开发一个项目,在R Studio中使用RecordLinkage来模糊合并两个非常大的数据集。
我的两个数据集 set1 和 set2 具有不同数量的变量,但我需要将基于两列的数据与名为" Address&#的列链接起来34;和#34; housetring。"但是,我总共有大约20个变量,无论是在数据集中还是在 set1 中,而不是 set2 。
此网站上的先前问题建议仅使用我想要匹配的列创建新集合 - 但是,我不能在我的流程中丢失其他变量。
这是我的代码:
RLBigDataLinkage(set1, set2, identity1=NA, identity2=NA, exclude=colname("zillow_id","comment","housenumber","unit","city","postalcode","district","state","id","random","Street","City","housestreet","fulladdress","parcelid","propertyid","usecode","latitude","longitude","housenumberfraction","streetdirectionprefix", "streetname","streetsuffix","streetdirectionsuffix","unitprefix","zipplusfour","street"), strcmp=TRUE,strcmpfun=jarowinkler)
即使我已经考虑了所有不匹配的列,但我仍然会收到 set1 和 set2 具有不同列数的错误。
任何建议都将不胜感激!