我有一份食物清单,我需要为其创建一个总体类别列。我的食物来源的一个例子如下:
FruitSources <- c("Apple Juice", "Apple Puree", "Apple Pieces", "Orange Juice", "Orange Pieces", "Banana Smoothie", "Banana Pieces",
"Apple & Blackcurrant Juice", "Mango & Banana Smoothie", "Watermelon, Apple & Orange Juice")
我希望仅为FruitSources
中的每个条目使用第一个单词创建此类别,而不是整行。例如,我的预期输出是:
Categories <- c("Apple", "Apple", "Apple", "Orange", "Orange", "Banana", "Banana", "Apple", "Other", "Other")
虽然某些条目的&
符号可能会导致Other
,但我更倾向于使用仅使用第一个单词的解决方案。在上面的例子中,除了苹果,橙子和香蕉之外的任何水果都会产生“其他”。一个粗略的方法是:
Output <- ifelse(FruitSources=='Apple', 'Apple',
ifelse(FruitSources=='Banana', 'Banana',
ifelse(FruitSources=='Orange', 'Orange', 'Other')))
但是,上述内容不会仅检测第一个单词,而是搜索整个字符串。这导致:
Output
[1] "Other" "Other" "Other" "Other" "Other" "Other" "Other" "Other" "Other" "Other"
之前我使用过嵌套的ifelse语句,但是可以将它们与grep结合使用并完成上述操作吗?
答案 0 :(得分:3)
假设所有包含&
或,
的字符串都应该包含&#34;其他&#34;正如预期的那样,所有其他人,第一个单词,然后使用grepl
生成基于&
和ifelse
和word
(来自stringr
)的逻辑向量第一个单词,如果没有&
,,
或者返回为&#34;其他&#34;
library(stringr)
ifelse(grepl("[&,]", FruitSources), "Other", word(FruitSources, 1))
#[1] "Apple" "Apple" "Apple" "Orange" "Orange" "Banana"
#[7] "Banana" "Other" "Other" "Other"
如果这是基于单个&#39; Fruit&#39; vs multiple&#39; Fruits&#39;,然后一个选项是str_count
来生成逻辑索引
ifelse(str_count(FruitSources, "\\b(Apple|Orange|Banana|Mango|Blackcurrant)\\b")==1,
word(FruitSources, 1), "Other")
#[1] "Apple" "Apple" "Apple" "Orange" "Orange" "Banana"
#[7] "Banana" "Other" "Other" "Other"
如果这是基于第一个输入词&#39; Apple&#39;,&#39; Orange&#39;或者&#39; Banana&#39;
ifelse(grepl("^(Apple|Orange|Banana)", FruitSources), word(FruitSources, 1), "Other")
#[1] "Apple" "Apple" "Apple" "Orange" "Orange" "Banana"
#[7] "Banana" "Apple" "Other" "Other"
答案 1 :(得分:0)
这是一个在基数R中使用正则表达式的解决方案。
它基于两个步骤。首先,在第一个位置提取关键字,并用空字符串替换其他字符串。
tmp <- sub("^(?:(Apple|Orange|Banana)|.?).*", "\\1", FruitSources)
# [1] "Apple" "Apple" "Apple" "Orange" "Orange" "Banana" "Banana" "Apple" "" ""
其次,用"Other"
替换空字符串。
sub("^$", "Other", tmp)
# [1] "Apple" "Apple" "Apple" "Orange" "Orange" "Banana" "Banana" "Apple" "Other" "Other"
在一行中:
sub("^$", "Other", sub("^(?:(Apple|Orange|Banana)|.?).*", "\\1", FruitSources))