我有一个数据框,'df
'。数据框非常大。数据非常模糊;它包含拼写错误,没有恒定的模式等参见示例
structure(list(ABC = structure(c(1L, 3L, 4L, 6L, 8L, 9L, 5L,
11L, 2L, 7L, 10L), .Label = c("2-8-2010 14:42:00 (number not ok)",
"2-8-2010 18:42:00 (nuber is not oke)", "2-8-2010 18:42:00 (number is not ok)",
"2-9-2010 14:47:00 (? Not ok )", "23:59 missing &^%", "26-9-2010 23.24",
"26-9-2010 23.24 not (working)", "26-9-2010 23.28 note: shutdown number!)",
"26-9-2010 23.29 (missing brackets", "Im oke and working\n",
"number"), class = "factor")), .Names = "ABC", row.names = c(NA,
-11L), class = "data.frame")
问:如何根据与目标字符串的匹配重新编码字符串变量?
在我的情况下,当字符串匹配单词“not working”和“number is not ok”时,如何重新编码变量'ABC'匹配,创建变量XYZ标记为'present'等。我的目标是:
structure(list(ABC = structure(c(2L, 4L, 5L, 7L, 9L, 10L, 6L,
1L, 12L, 3L, 8L, 11L), .Label = c("", "2-8-2010 14:42:00 (number not ok)",
"2-8-2010 18:42:00 (nuber is not oke)", "2-8-2010 18:42:00 (number is not ok)",
"2-9-2010 14:47:00 (? Not ok )", "23:59 missing &^%", "26-9-2010 23.24",
"26-9-2010 23.24 not (working)", "26-9-2010 23.28 note: shutdown number!)",
"26-9-2010 23.29 (missing brackets", "Im oke and working\tabsent\n",
"number"), class = "factor"), XYZ = structure(list(XYZ = structure(c(3L,
3L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 3L, 1L), .Label = c("absent",
"missing", "present"), class = "factor")), .Names = "XYZ", class = "data.frame", row.names = c(NA,
-12L))), .Names = c("ABC", "XYZ"), row.names = c(NA, -12L), class = "data.frame")
我知道,Stack上有一些看起来相同的例子但我无法让它们正常工作。我希望有人能把我推向正确的方向。
谢谢
答案 0 :(得分:1)
> df$XYZ <- ifelse(grepl("not.*working|number.*[is]?.*not.*ok", df$ABC, ignore.case = TRUE), "present", "absent")
> df
ABC XYZ
1 2-8-2010 14:42:00 (number not ok) present
2 2-8-2010 18:42:00 (number is not ok) present
3 2-9-2010 14:47:00 (? Not ok ) absent
4 26-9-2010 23.24 absent
5 26-9-2010 23.28 note: shutdown number!) absent
6 26-9-2010 23.29 (missing brackets absent
7 23:59 missing &^% absent
8 number absent
9 2-8-2010 18:42:00 (nuber is not oke) absent
10 26-9-2010 23.24 not (working) present
11 Im oke and working\n absent
答案 1 :(得分:0)
没有grep的不同解决方案。您可以根据需要添加任意数量的子句。
regexpr('string_to_look_for',variable)返回字符串中的位置。因此,如果评估为零以外的任何值,则会找到该字符串。
df$XYZ <- ifelse(regexpr('number is not ok',df$ABC)>0 |
regexpr('not working',df$ABC)>0 |
regexpr('not',df$ABC)>0,"present","absent")
ABC XYZ
1 2-8-2010 14:42:00 (number not ok) present
2 2-8-2010 18:42:00 (number is not ok) present
3 2-9-2010 14:47:00 (? Not ok ) absent
4 26-9-2010 23.24 absent
5 26-9-2010 23.28 note: shutdown number!) present
6 26-9-2010 23.29 (missing brackets absent
7 23:59 missing &^% absent
8 number absent
9 2-8-2010 18:42:00 (nuber is not oke) present
10 26-9-2010 23.24 not (working) present
11 Im oke and working\n absent
请注意,查找“not”的最后一个子句实际上在“note”中找到了。如果您确切地知道要查找的字符串,则可以对它们进行硬编码。 @mlegge代码更优雅,但如果你是一个像我这样的菜鸟,那就更难理解了。