我有2列的数据集,我想通过使用gsub来清理我的数据集,例如
Data_edited_txt2 <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", Data_edited_txt2$text)
Data_edited_txt2 <- gsub("@\\w+", " ", Data_edited_txt2$text)
Data_edited_txt2 <- gsub("[[:punct:]]", "", Data_edited_txt2$text)
我会在第二次运行gsub时遇到错误:“$ operator对原子矢量无效”,我注意到第二列在运行第一个gsub后会消失。
请告知如何执行所有gsub,但保留第二列?
structure(list(text = structure(c(1L, 3L, 7L, 4L, 2L, 5L, 6L), .Label = c("@airasia im searching job",
"@AirAsia no flight warning for cebu outbound?", "@shazzr1 @AirAsia never mind.. now everyone can fly.",
"@TigerAir confirmed as having far nastier policies and uncaring customer service than @airasia who I will now fly every time in preference.",
"@Wingmates Since your taxes is HIGHER than other airlines but your service is really BAD because always change and cancel the flight.",
"hai MASwings @Wingmates . Bilakah tempoh promosi anda? Saya ingin terbang ke Palawan dengan bajet yang agak rendah :3",
"One thing I \"like\" about @AirAsia is, DELAY."), class = "factor"),
created = structure(c(3L, 2L, 1L, 7L, 6L, 4L, 5L), .Label = c("2/2/2014 11:30",
"2/2/2014 11:32", "2/2/2014 12:18", "24/2/2014 4:03", "29/3/2014 8:21",
"30/1/2014 16:02", "31/1/2014 8:13"), class = "factor")), .Names = c("text",
"created"), class = "data.frame", row.names = c(NA, -7L))
答案 0 :(得分:0)
您覆盖整个数据框而不是仅覆盖一列。试试这个:
Data_edited_txt2$text <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", Data_edited_txt2$text)
Data_edited_txt2$text <- gsub("@\\w+", " ", Data_edited_txt2$text)
Data_edited_txt2$text <- gsub("[[:punct:]]", "", Data_edited_txt2$text)