如何使用ifelse和grepl创建一个基于具有长字符串的列的子字符串的新列?

时间:2017-06-06 15:59:56

标签: r if-statement grepl

首先查看列ac $ summary

的行
    1
    during a demonstration flight, a u.s. army flyer flown by orville wright nose-dived into the ground from a height of approximately 75 feet, killing lt. thomas e. selfridge who was a passenger. this was the first recorded airplane fatality in history. one of two propellers separated in flight, tearing loose the wires bracing the rudder and causing the loss of control of the aircraft. orville wright suffered broken ribs, pelvis and a leg. selfridge suffered a crushed skull and died a short time later.
    2
    first u.s. dirigible akron exploded just offshore at an altitude of 1,000 ft. during a test flight.
    3
    the first fatal airplane accident in canada occurred when american barnstormer, john m. bryant, california aviator was killed.
    4
    the airship flew into a thunderstorm and encountered a severe downdraft crashing 20 miles north of helgoland island into the sea. the ship broke in two and the control car immediately sank drowning its occupants.
    5
    hydrogen gas which was being vented was sucked into the forward engine and ignited causing the airship to explode and burn at 3,000 ft..
    6
    crashed into trees while attempting to land after being shot down by british and french aircraft.
    7
    exploded and burned near neuwerk island, when hydrogen gas, being vented, was ignited by lightning.
    8
    crashed near the black sea, cause unknown.
    9
    shot down by british aircraft crashing in flames.
    10
    shot down in flames by the british 39th home defence squadron.
    11
    crashed in a storm.
    12
    shot down by british anti-aircraft fire and aircraft and crashed into the north sea.
    13
    caught fire and crashed. 

我想根据ac $ summary

制作ac $ sumnew列

我编写了以下代码,但它没有返回所需的输出     两者兼而有之和|被使用了。当|使用,结果不规则。有时是对的,有时是错的。

    ac$sumnew = ifelse(grepl("missing & crashed",ac$Summary),"missing and crashed",
        ifelse(grepl("shot | crashed",ac$Summary),"shot down and crashed",
        ifelse(grepl("struck | lightening",ac$Summary),"struck by lightening and crashed",
         ifelse(grepl("struck | bird & crashed",ac$Summary),"struck by bird and crashed",
         ifelse(grepl("exploded | crashed",ac$Summary),"exploded and crashed",
         ifelse(grepl("engine | failure",ac$Summary),"engine failure",
         ifelse(grepl("fog | crashed",ac$Summary),"crashed due to heavy fog",
         ifelse(grepl("fire | crashed",ac$Summary),"caught fire and crashed",
         ifelse(grepl("shot",ac$Summary),"shot down",             
         ifelse(grepl("crashed",ac$Summary),"Crashed",
         ifelse(grepl("shot",ac$Summary),"Shot down",
         ifelse(grepl("disappeared",ac$Summary),"Disappeared",
         ifelse(grepl("struck | obstacle | crashed ",ac$Summary),"struck by obstacle and Crashed",
         ifelse(grepl("crashed",ac$Summary),"crashed",
         ifelse(grepl("exploded",ac$Summary),"exploded",
         ifelse(grepl("fire",ac$Summary),"caught fire","others"))))))))))))))))

例如,如果飞机已经被击中,它应该返回"击落"

如果它刚刚崩溃,输出应该返回"崩溃"

如果它既丢失又崩溃,它应该返回"丢失并崩溃"

我无法正确使用&和|还

获得的输出如下所示

1
others
2
exploded and crashed
3
others
4
others
5
engine failure
6
shot down and crashed
7
exploded and crashed
8
Crashed
9
shot down and crashed
10
shot down and crashed
11
Crashed
12
missing and crashed
13
missing and crashed
14
missing and crashed
15
Crashed
16
shot down and crashed
17
shot down and crashed

1 个答案:

答案 0 :(得分:1)

我认为你有层次结构问题。 R按顺序测试这些,因此您必须以适当的方式进行排列。这是一个帮助的链接:https://www.programiz.com/r-programming/if-else-statement

ac$new  <-ifelse(apply(sapply(c("struck","bird","crash"), grepl, as.character(s$s)), 1, all) ,"struck by bird and crashed",
          ifelse(apply(sapply(c("struck","obstacle","crash"), grepl, as.character(s$s)), 1, all) ,"struck by obstacle and Crashed",
          ifelse(apply(sapply(c("miss" , "crash"), grepl, as.character(s$s)), 1, all) ,"missing and crashed",
          ifelse(apply(sapply(c("shot" , "crash"), grepl, as.character(s$s)), 1, all) ,"shot down and crashed",
          ifelse(apply(sapply(c("struck","lightening"), grepl, as.character(s$s)), 1, all) ,"struck by lightening and crashed",
          ifelse(apply(sapply(c("explode","crash"), grepl, as.character(s$s)), 1 , all) ,"exploded and crashed",
          ifelse(apply(sapply(c("engine|failure"), grepl, as.character(s$s)), 1 , all) ,"engine failure",
          ifelse(apply(sapply(c("fog","crash"), grepl, as.character(s$s)) , 1, all) ,"crashed due to heavy fog",
          ifelse(apply(sapply(c("fire","crash"), grepl, as.character(s$s)), 1, all) ,"caught fire and crashed",
          ifelse(apply(sapply("shot", grepl, as.character(s$s)), 1, all) ,"shot down",
          ifelse(apply(sapply("crash", grepl, as.character(s$s)), 1, all), "crashed",
          ifelse(apply(sapply("explode", grepl, as.character(s$s)), 1, all), "exploded",
          ifelse(apply(sapply("fire", grepl, as.character(s$s)), 1, all),"caught fire",
          ifelse(apply(sapply("disappear", grepl, as.character(s$s)), 1, all), "Disappeared","others"))))))))))))))

现在,这可以通过检查c()中的所有字词,然后将值等同于ac$newengine|failure除外。另外,因为我们正在使用单词,所以您希望使用最简单的词干来检查所有变体:例如,您应该使用“miss”而不是使用“missing”。

我得到了

1                   others
2                 exploded
3                   others
4                  crashed
5           engine failure
6    shot down and crashed
7                 exploded
8                  crashed
9    shot down and crashed
10               shot down
11                 crashed
12   shot down and crashed
13 caught fire and crashed

有些单词上面没有匹配,因为我确实检查了所有单词。我检查所有单词的原因是因为您在“ifelse”链的后半部分中识别出单个单词。我确实进行过眼球测试,我认为我的测试基于检查所有单词是正确的。

顺便说一句,这很乏味,特别是如果你想扩展名单。你可能想要使用像

这样的东西
ac <- data.frame(s = as.character(t), word.que = seq(1, length(t), by = 1))

ac$word.count <- sapply(gregexpr(" ", ac$s), length) + 1

new.mat <- data.frame(word.que = rep.int(ac$word.que, ac$word.count), word = unlist(strsplit(as.character(ac$s), split = " ")))
words.of.interest <- c("struck|bird|crash|obstacle|miss|shot|struck|lightening|explode|engine|failure|fog|fire|disappear")
new.mats<- new.mat %>%
           mutate(word = gsub("\\,", "", gsub("\\.", "", word))) %>%
           mutate(word.interest = ifelse(grepl(words.of.interest, as.character(word)), 1, 0)) %>%
           filter(word.interest == 1) %>%
           group_by(word.que) %>% 
           summarise(word.list = paste0(unique(word), collapse = "; ")) %>%
           full_join(ac, by = "word.que" ) %>%
           arrange(word.que) %>%
           mutate(word.list = ifelse(is.na(word.list), 'other', word.list))

这将为您构建一个更有效的搜索列表。结果是

   word.que           word.list
1         1               other
2         2            exploded
3         3               other
4         4            crashing
5         5     engine; explode
6         6       crashed; shot
7         7            exploded
8         8             crashed
9         9      shot; crashing
10       10                shot
11       11             crashed
12       12 shot; fire; crashed
13       13       fire; crashed

以及您的文字变量和word.count。从长远来看,这可能更有效。