在具有2列的数据帧中执行gsub

时间:2014-04-16 10:32:20

标签: regex r string gsub text-extraction

我有2列的数据集,我想通过使用gsub来清理我的数据集,例如

Data_edited_txt2 <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", Data_edited_txt2$text)
Data_edited_txt2 <- gsub("@\\w+", " ", Data_edited_txt2$text)
Data_edited_txt2 <- gsub("[[:punct:]]", "", Data_edited_txt2$text) 

我会在第二次运行gsub时遇到错误:“$ operator对原子矢量无效”,我注意到第二列在运行第一个gsub后会消失。

请告知如何执行所有gsub,但保留第二列?

structure(list(text = structure(c(1L, 3L, 7L, 4L, 2L, 5L, 6L), .Label = c("@airasia im searching job", 
"@AirAsia no flight warning for cebu outbound?", "@shazzr1 @AirAsia never mind.. now everyone can fly.", 
"@TigerAir confirmed as having far nastier policies and uncaring customer service than @airasia who I will now fly every time in preference.", 
"@Wingmates Since your taxes is HIGHER than other airlines but your service is really BAD because always change and cancel the flight.", 
"hai MASwings @Wingmates . Bilakah tempoh promosi anda? Saya ingin terbang ke Palawan dengan bajet yang agak rendah :3", 
"One thing I \"like\" about @AirAsia is, DELAY."), class = "factor"), 
created = structure(c(3L, 2L, 1L, 7L, 6L, 4L, 5L), .Label = c("2/2/2014 11:30", 
"2/2/2014 11:32", "2/2/2014 12:18", "24/2/2014 4:03", "29/3/2014 8:21", 
"30/1/2014 16:02", "31/1/2014 8:13"), class = "factor")), .Names = c("text", 
"created"), class = "data.frame", row.names = c(NA, -7L))

1 个答案:

答案 0 :(得分:0)

您覆盖整个数据框而不是仅覆盖一列。试试这个:

Data_edited_txt2$text <- gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", Data_edited_txt2$text)
Data_edited_txt2$text <- gsub("@\\w+", " ", Data_edited_txt2$text)
Data_edited_txt2$text <- gsub("[[:punct:]]", "", Data_edited_txt2$text)