匹配缩写名称及其全名

时间:2017-02-23 03:15:06

标签: r

我最近在处理一个需要按公司名称组合两个数据集的项目。但是,一个数据集中的公司名称(比如说A)只是在另一个数据集中缩写(例如B),例如数据集A中的“ATT”或“AM TEL& TEL”,“AMERICAN TELEPHONE& TELEG CO“在B。

我的第一次尝试是用空格打破两个数据集中的名称,然后取出每个碎片的第一个字母然后匹配它们但是没有找到通过空格打破字符串的方法失败。

我也试过grepl和grep,但它只适用于没有空格的字符串,必须给出模式。

可能这可以使用一些常规的exp技术来完成,但在写这篇文章之前我仍然没有找到完成此任务的方法。

这项任务可以由R完成吗?如果有,怎么样? 以下是我的数据集中的一些数据。

structure(list(abbreviated = structure(c(1L, 2L, 3L, 4L, 5L, 
6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 
19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 26L, 27L, 27L, 28L, 29L, 
30L, 31L, 32L, 49L, 60L, 51L, 52L, 33L, 34L, 35L, 36L, 37L, 38L, 
39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 50L, 53L, 54L, 
56L, 57L, 58L, 55L, 59L, 61L), .Label = c("20 20 SPORT", "20TH CENTRY", 
"20th Century Fox", "20TH CENTY", "21ST CENTY TELECOM GROUP INC", 
"238 Telecom Limited", "24 7", "24 7 MEDIA INC", "24 7 Real Media Inc", 
"247Media Inc", "360 COMMUN", "360NETWORKS INC", "3C COMM INTL", 
"3COM", "3Com Corp", "3COM Corp", "3COM CORP", "3D COMMUN", "3D Industrial Electronics PTE", 
"3Dfx", "3m", "3M", "3M Co", "3M CO", "3M Corporation", "3M Unitek", 
"3M UNITEK", "3SBio Inc", "7 Eleven  Inc ", "7 Eleven Inc", "7 ELEVEN INC", 
"A   C WHSL", "A 1 International Inc", "A 1 INTL INC", "A 1 LEASING", 
"A 2 Z STORES", "A A FOODS", "A A R P", "A B DICK", "A B DRACO", 
"A C COIN SLOT", "A D", "A D  KRAUTH", "A D I", "a e", "A E TELEVISION NTWK", 
"A G A BURDOX", "A G I P S P A", "A G INC", "A H ROBINS", "A Lassonde Inc ", 
"A P", "A S Dampskibsselskabet Torm", "A STURM   SON", "A T Clayton   Co", 
"A T T", "A T T Corp", "A T T CORP", "A T T TECH", "A W RESTRNT", 
"A123 Systems"), class = "factor"), full = structure(c(1L, 2L, 
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 
17L, 18L, 19L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), .Label = c("20TH CENTURY ENERGY CORP", "20TH CENTURY INDUSTRIES", 
"20TH CENTURY INDUSTRIES CA", "21ST CENTURY DISTRIBUTION CORP", 
"21ST CENTURY FILMS CORP", "21ST CENTURY HOLDING CO", "21ST CENTURY INSURANCE GROUP", 
"21ST CENTURY ROBOTICS", "24 7 MEDIA INC", "24 7 REAL MEDIA INC", 
"360 COMMUNICATIONS CO", "360NETWORKS INC", "3COM CORP", "3DFX INTERACTIVE INC", 
"3M CO", "3SBIO INC", "7 ELEVEN INC", "A   A FOODS LTD", "ROBINS A H INC"
), class = "factor")), .Names = c("abbreviated", "full"), row.names = c(NA, 
63L), class = "data.frame")

任何建议都将深表感谢。提前谢谢。

0 个答案:

没有答案