检查一列中的字符串是否与另一列中字符串的缩写形式匹配

时间:2018-02-06 22:53:27

标签: r string dataframe

我有一个大型数据框" df"有2列:

**column1**                             **column2**
The City of New York                     TCNY
The Land of the Free                     TLF
Stellar Stars Basketball Program         SSBP
Center for Life Sciences                 CLS
Children's Hospital of Los Angeles       CHLA
New York Yankees                         NY
etc                                      etc

我做了一些研究,看到你可以使用mapply同时在两个栏目上执行一项功能,但我不确定我会做什么功能。我正在考虑做一些功能,其中函数检查column1的字符串中的所有大写字母,并检查列2中是否存在这些大写字母,但实际上不确定如何...任何帮助都会很棒!非常感谢你!

3 个答案:

答案 0 :(得分:0)

以下是我认为您可能尝试实现的示例(在您在问题中显示的行的子集上):

df <- data.frame(
  col_1 = c("The City of New York", "The Land of the Free", "New York Yankees"),
  col_2 = c("TCNY", "TLF", "NY")
)

> df
                 col_1 col_2
1 The City of New York  TCNY
2 The Land of the Free   TLF
3     New York Yankees    NY

# Add a third column indicating whether the capitalised letters of the first
# column are equal to the strings in the second
df$col_3 <- unlist(apply(df, 1, function(x) gsub("[^A-Z]", "", x[1]) == x[2]))

> df
                 col_1 col_2 col_3
1 The City of New York  TCNY  TRUE
2 The Land of the Free   TLF  TRUE
3     New York Yankees    NY FALSE

上面我使用gsub从第一列值中删除任何非大写字母的字符,然后将它们与apply语句中的第二列进行比较,该语句对每个字符进行操作数据帧的一行。然后我使用unlist将结果从列表转换为向量,该向量可以存储在数据框df的第三列中。

答案 1 :(得分:0)

使用base r

   transform(dat,correctABBV=x<-gsub("[^A-Z]","",column1),check=x==column2)
                             column1 column2 correctABBV check
1               The City of New York    TCNY        TCNY  TRUE
2               The Land of the Free     TLF         TLF  TRUE
3   Stellar Stars Basketball Program    SSBP        SSBP  TRUE
4           Center for Life Sciences     CLS         CLS  TRUE
5 Children's Hospital of Los Angeles    CHLA        CHLA  TRUE
6                   New York Yankees      NY         NYY FALSE

答案 2 :(得分:0)

这是一种方法。我不确定你是否想要etc作为缩写。目前,我将其视为缩写。首先,我想根据第一列创建缩写。我使用stri_count()检查了每个字符串中存在多少个单词。当答案对逻辑条件为TRUE时,我使用gsub()提取大写字母。当答案对于逻辑条件为假时,我将mycol1中的元素添加到abb。最后,我检查了abbmycol2中的元素是否相同,并创建了check

mydf <- data.frame(mycol1 = c("The City of New York", "The Land of the Free", "Stellar Stars Basketball Program",
                              "Center for Life Sciences", "Children's Hospital of Los Angeles", "New York Yankees", "etc"),
                   mycol2 = c("TCNY", "TLF", "SSBP", "CLS", "CHLA", "NY", "etc"),
                   stringsAsFactors = FALSE)    

library(dplyr)
library(stringi)

mutate(mydf,
       abb = if_else(stri_count(mycol1, regex = "\\w+") > 1,
                     gsub(x = mycol1, pattern = "[^A-Z]",replacement = ""),
                     mycol1),
       check = abb == mycol2)

                              mycol1 mycol2  abb check
1               The City of New York   TCNY TCNY  TRUE
2               The Land of the Free    TLF  TLF  TRUE
3   Stellar Stars Basketball Program   SSBP SSBP  TRUE
4           Center for Life Sciences    CLS  CLS  TRUE
5 Children's Hospital of Los Angeles   CHLA CHLA  TRUE
6                   New York Yankees     NY  NYY FALSE
7                                etc    etc  etc  TRUE