Question

我有一个带有“注释”列的数据集，我正在尝试使用R对其进行清理。注释看起来像这样：

总共收集了2个工时。多云，即将来临的风暴。
总共收集了2个工时。多雨
使用2名员工收集30分钟，总共进行1个工时的采样。阳光明媚。
..依此类推

我想删除所有以“ Collected”开头的句子，但不删除其后的所有句子。随后的句子数量有所不同，例如从0-4句之后。我试图删除“ +”（句子的最后一个单词）的所有组合，但是组合太多。删除收集的+ [。]会删除所有后续句子。有没有人有什么建议？预先谢谢你。

Answer 1

使用gsub的选项可以是：

gsub("^Collected[^.]*\\. ","",df$Notes)

# [1] "Cloudy, imminent storms."
# [2] "Rainy."                  
# [3] "Sunny."

Regex explanation:

 - `^Collected`    : Starts with `Collected`
 - `[^.]*`         : Followed by anything other than `.`
 - `\\. `          : Ends with `.` and `space`.

将此类匹配项替换为""。

数据：

df<-read.table(text=
"Notes
'Collected for 2 man-hours total. Cloudy, imminent storms.'
'Collected for 2 man-hours total. Rainy.'
'Collected 30 min w/2 staff for a total of 1 man-hour of sampling. Sunny.'",
header = TRUE, stringsAsFactors = FALSE)

Answer 2

a = "Collected 30 min w/2 staff for a total of 1 man-hour of sampling. Sunny."
sub("^ ","",sub("Collected.*?\\.","",a))

> [1] "Sunny."

或者，如果您知道句点后还会有空格：

 sub("Collected.*?\\. ","",a)

删除所有以特定单词开头的句子

2 个答案: