将列与ID字符串匹配,并在新列中分配新值

时间:2018-09-27 13:24:10

标签: r

我有此数据:

USDfirms <- c("GOOG", "BABA" "0071.TW")
TWRfirms <- c("3231.TW")
JPYfirms <- c("7752.T")

我正在尝试使用grepl函数来创建新列。因此,如果ticker数据中的df与上述3个字符串向量之一中的公司3231.TW匹配,则分配一个值(TWRmatch)或ticker与公司GOOG分配值USDmatch等。

ticker的值可能并不总是很合适,即ticker 3231与3231.TW并不完全匹配,这就是为什么我想使用grepl来匹配时忽略.TW。

df <- structure(list(symbol = c("3231.TW", "3231.TW", "3231.TW", "3231.TW", 
"7752.T", "7752.T", "7752.T", "7752.T", "GOOG", "GOOG", "GOOG", 
"GOOG", "BABA", "BABA", "BABA", "BABA"), ticker = c("3231", "3231", 
"3231", "3231", "7752", "7752", "7752", "7752", "GOOG", "GOOG", 
"GOOG", "GOOG", "BABA", "BABA", "BABA", "BABA"), country = c("TW", 
"TW", "TW", "TW", "T", "T", "T", "T", NA, NA, NA, NA, NA, NA, 
NA, NA), year = c(2017L, 2016L, 2015L, 2014L, 2018L, 2017L, 2016L, 
2015L, 2017L, 2016L, 2015L, 2014L, 2018L, 2017L, 2016L, 2015L
)), .Names = c("symbol", "ticker", "country", "year"), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 123L, 124L, 125L, 126L, 127L, 128L, 
129L, 130L), class = "data.frame")

编辑:

此功能似乎无效

ifelse(grepl(USDfirms, df$ticker), "yes", "no")

我也很害怕:

df$match <- ifelse(USDfirms %in% x$ticker, "yes", "no")

哪件事我都同意。

1 个答案:

答案 0 :(得分:1)

不是完美的解决方案,但蛮力方法可能是使用嵌套的lapply / sapply解决方案。在这里,每个ticker上都有一个双循环,遍历firm_list的每个元素,我们检查它是否存在于列表中的任何元素中,如果存在,则提取该列表的名称。

df$firms <- unlist(lapply(df$ticker, function(x)
        unlist(sapply(seq_along(firm_list), function(y) {
           if (any(grepl(x, unlist(firm_list[y])))) 
               names(firm_list[y])
})))) 

df

#     symbol ticker country year    firms
#1   3231.TW   3231      TW 2017 TWRfirms
#2   3231.TW   3231      TW 2016 TWRfirms
#3   3231.TW   3231      TW 2015 TWRfirms
#4   3231.TW   3231      TW 2014 TWRfirms
#5    7752.T   7752       T 2018 JPYfirms
#6    7752.T   7752       T 2017 JPYfirms
#7    7752.T   7752       T 2016 JPYfirms
#8    7752.T   7752       T 2015 JPYfirms
#123    GOOG   GOOG    <NA> 2017 USDfirms
#124    GOOG   GOOG    <NA> 2016 USDfirms
#125    GOOG   GOOG    <NA> 2015 USDfirms
#126    GOOG   GOOG    <NA> 2014 USDfirms
#127    BABA   BABA    <NA> 2018 USDfirms
#128    BABA   BABA    <NA> 2017 USDfirms
#129    BABA   BABA    <NA> 2016 USDfirms
#130    BABA   BABA    <NA> 2015 USDfirms

我们将所有公司移动到列表中,以便于检查。

firm_list <- list(USDfirms = c("GOOG", "BABA", "0071.TW"), 
                  TWRfirms = c("3231.TW"), 
                  JPYfirms = c("7752.T"))

或者实际上,如果我们创建查找数据帧然后进行匹配并从中提取内容,它将更加方便快捷。

ref_df <- data.frame(firms = unlist(firm_list), 
           names = rep(names(firm_list), lengths(firm_list)))

df$firms <- ref_df$names[sapply(df$ticker, function(x) grep(x, ref_df$firms))]


df
#     symbol ticker country year    firms
#1   3231.TW   3231      TW 2017 TWRfirms
#2   3231.TW   3231      TW 2016 TWRfirms
#3   3231.TW   3231      TW 2015 TWRfirms
#4   3231.TW   3231      TW 2014 TWRfirms
#5    7752.T   7752       T 2018 JPYfirms
#6    7752.T   7752       T 2017 JPYfirms
#7    7752.T   7752       T 2016 JPYfirms
#8    7752.T   7752       T 2015 JPYfirms
#123    GOOG   GOOG    <NA> 2017 USDfirms
#124    GOOG   GOOG    <NA> 2016 USDfirms
#125    GOOG   GOOG    <NA> 2015 USDfirms
#126    GOOG   GOOG    <NA> 2014 USDfirms
#127    BABA   BABA    <NA> 2018 USDfirms
#128    BABA   BABA    <NA> 2017 USDfirms
#129    BABA   BABA    <NA> 2016 USDfirms
#130    BABA   BABA    <NA> 2015 USDfirms