如何用gsub匹配字符串并同时删除不匹配的字符串?

时间:2019-06-03 19:15:26

标签: r lapply gsub

我有一个带有一列字符串的数据框,我想将其进一步标记为以下类别:城市,国家和大洲。我用gsub将所有城市替换为“城市”,将所有国家替换为“国家”,并将所有大洲替换为“大陆”。

#This is what I have
dataframe
Color     Letter     Words
red       A          Paris,Asia,parrot,Antarctica,North America,cat,lizard
blue      A          Panama,New York,Africa,dog,Tokyo,Washington DC,fish
red       B          Copenhagen,bird,USA,Japan,Chicago,Mexico,insect
blue      B          Israel,Antarctica,horse,South America,North America,turtle,Brazil

#This is what I want
dataframe
Color     Letter     New
red       A          City,Continent
blue      A          Country,City,Continent
red       B          City,Country
blue      B          Country,Continent


#This is the code I have so far
dataframe$New <- NA

#groups all the cities
dataframe$New <- lapply)dataframe$Words, function(x) {
   gsub("Paris|New York|Tokyo|Washington DC|Copenhagen|Chicago", "City", x)})

#groups all the countries
dataframe$New <- lapply)dataframe$Words, function(x) {
   gsub("Panama|USA|Japan|Mexico|Israel|Brazil", "Country", x)})

#groups all the continents
dataframe$New <- lapply)dataframe$Words, function(x) {
   gsub("Asia|Antarctica|Africa|North America|South America", "Continent", x)})

dataframe$Words <- NULL

如何保持每次在dataframe $ New中的覆盖,以及如何删除多余的单词(例如鱼,马,猫)?

以上数据是基于非常大的数据集的示例。在数据集中,单词列有很多重复。请参阅下面的dataframe $ Words中的一些示例行:

Words
Panama,Paris
Panama,Israel,cat
Panama,Paris,horse,
Panama,Asia
Panama
Panama,Chicago
Israel,Chicago
Israel,lizard,Paris
Israel,Panama,horse,Africa
```

2 个答案:

答案 0 :(得分:1)

考虑粘贴几个ifelse调用以检查特定字符串:

dataframe$New <- paste(ifelse(grepl("Paris|New York|Tokyo|Washington DC|Copenhagen|Chicago", dataframe$Words), "City", "N/A"), 
                       ifelse(grepl("Panama|USA|Japan|Mexico|Israel|Brazil", dataframe$Words), "Country", "N/A"),
                       ifelse(grepl("Asia|Antarctica|Africa|North America|South America", dataframe$Words), "Continent", "N/A"),
                       sep=",")

dataframe$New <- gsub("N/A,|,N/A", "", dataframe$New)

dataframe

#   Color Letter                                                             Words                    New
# 1   red      A             Paris,Asia,parrot,Antarctica,North America,cat,lizard         City,Continent
# 2  blue      A               Panama,New York,Africa,dog,Tokyo,Washington DC,fish City,Country,Continent
# 3   red      B                   Copenhagen,bird,USA,Japan,Chicago,Mexico,insect           City,Country
# 4  blue      B Israel,Antarctica,horse,South America,North America,turtle,Brazil      Country,Continent

或带有do.call + lapply的烘干机版本:

strs <- list(c("Paris|New York|Tokyo|Washington DC|Copenhagen|Chicago", "City"),
             c("Panama|USA|Japan|Mexico|Israel|Brazil", "Country"),
             c("Asia|Antarctica|Africa|North America|South America", "Continent"))

df$New2 <- do.call(paste,
                   c(lapply(strs, function(s) ifelse(grepl(s[1], df$Words), s[2], "N/A")), 
                     list(sep=",")))
df$New2 <- gsub("N/A,|,N/A", "", df$New2)

答案 1 :(得分:0)

最好创建一个{/ {1}}键/值对,然后在替换后通过匹配'键'提取元素

list

将与library(gsubfn) # key val list lst1 <- list(Paris = "City", `New York` = "City", Tokyo = "City", `Washington DC` = "City", Copenhagen = "City", Chicago = "City", Panama = "Country", USA = "Country", Japan = "Country", Mexico = "Country", Israel = "Country", Brazil = "Country", Asia = "Continent", Antarctica = "Continent", Africa = "Continent", `North America` = "Continent", `South America` = "Continent") 的匹配值提取到strapply中,并用listlist的{​​{1}}字符串遍历sapply “城市”,“大陆”或“国家”

paste

数据

unique