我有一个带有“注释”列的数据集,我正在尝试使用R对其进行清理。注释看起来像这样:
我想删除所有以“ Collected”开头的句子,但不删除其后的所有句子。随后的句子数量有所不同,例如从0-4句之后。我试图删除“ +”(句子的最后一个单词)的所有组合,但是组合太多。删除收集的+ [。]会删除所有后续句子。有没有人有什么建议?预先谢谢你。
答案 0 :(得分:5)
使用gsub
的选项可以是:
gsub("^Collected[^.]*\\. ","",df$Notes)
# [1] "Cloudy, imminent storms."
# [2] "Rainy."
# [3] "Sunny."
Regex explanation: - `^Collected` : Starts with `Collected` - `[^.]*` : Followed by anything other than `.` - `\\. ` : Ends with `.` and `space`.
将此类匹配项替换为""
。
数据:
df<-read.table(text=
"Notes
'Collected for 2 man-hours total. Cloudy, imminent storms.'
'Collected for 2 man-hours total. Rainy.'
'Collected 30 min w/2 staff for a total of 1 man-hour of sampling. Sunny.'",
header = TRUE, stringsAsFactors = FALSE)
答案 1 :(得分:4)
a = "Collected 30 min w/2 staff for a total of 1 man-hour of sampling. Sunny."
sub("^ ","",sub("Collected.*?\\.","",a))
> [1] "Sunny."
或者,如果您知道句点后还会有空格:
sub("Collected.*?\\. ","",a)