我有一个文本文件,如下所示:
B.1 Blah blah blah
Random sentence.
B.2 Blah blah blah
Random sentence.
我想获得输出:
B1 Blah blah blah
Random sentence.
B2 Blah blah blah
Random sentence.
我不确定如何删除B.1和B.2中的特定时间段。我不想删除任何其他期间。我想知道我该怎么做。谢谢。
答案 0 :(得分:3)
lines <- c('B.1 Blah blah blah', 'Random sentence.', 'B.2 Blah blah blah', 'Random sentence.')
最初的方法(最文字)是寻找以下数字:
gsub("\\.([0-9])", "\\1", lines)
# [1] "B1 Blah blah blah" "Random sentence." "B2 Blah blah blah" "Random sentence."
如果重要的是缺少空格(或行尾),那么
gsub("\\.(\\S)", "\\1", lines)
# [1] "B1 Blah blah blah" "Random sentence." "B2 Blah blah blah" "Random sentence."
其中\\S
表示对空格的取反。 (有关更多信息,请参见?regex
`。)
如果存在合法的十进制数字(并且您的语言环境使用小数点的句点),这当然会失败:
lines <- c('B.1 Blah blah blah', 'Random sentence.', 'B.2 Blah blah blah', 'Random sentence.', 'pi is 3.14')
gsub("\\.(\\S)", "\\1", lines)
# [1] "B1 Blah blah blah" "Random sentence." "B2 Blah blah blah" "Random sentence." "pi is 314"
此修复程序只是一个正则表达式:
gsub("([^0-9])\\.(\\S)", "\\1\\2", lines)
# [1] "B1 Blah blah blah" "Random sentence." "B2 Blah blah blah" "Random sentence." "pi is 3.14"
尽管这现在不会捕获前导点:
lines <- c('B.1 Blah blah blah', 'Random sentence.', 'B.2 Blah blah blah', 'Random sentence.',
'pi is 3.14', '.leading dots are bad.')
gsub("([^0-9])\\.(\\S)", "\\1\\2", lines)
# [1] "B1 Blah blah blah" "Random sentence." "B2 Blah blah blah" "Random sentence."
# [5] "pi is 3.14" ".leading dots are bad."
所以我们要使事情变得更加复杂
gsub("(^|[^0-9])\\.(\\S)", "\\1\\2", lines)
# [1] "B1 Blah blah blah" "Random sentence." "B2 Blah blah blah" "Random sentence." "pi is 3.14"
# [6] "leading dots are bad."
由于担心XKCD/1171 Perl Problems,这和我想来这里一样复杂。
答案 1 :(得分:1)
尽管@r2evans
涵盖了几乎所有方面,但仍然考虑添加一个选项,该选项将检查.
之后是alpha-bates
,然后是digits
,然后仅删除.
。
#Data
lines <- c("B.1 Blah blah blah", "Random sentence.",
"B.2 Blah blah blah", "Random sentence.")
gsub("(.*[[:alpha:]]+)[.]([[:digit:]]+.*)","\\1\\2",lines)
#[1] "B1 Blah blah blah" "Random sentence." "B2 Blah blah blah" "Random sentence."
正则表达式说明
(.*[[:alpha:]]+) : Group 1 for place holder. Anything that follows a alpha-bate [.] : . ([[:digit:]]+.*) : At least a digit and then anything that follows it