从文本文件的某些部分删除标点符号

时间:2018-07-10 19:16:37

标签: r

我有一个文本文件,如下所示:

B.1 Blah blah blah
Random sentence.
B.2 Blah blah blah
Random sentence.

我想获得输出:

B1 Blah blah blah
Random sentence.
B2 Blah blah blah
Random sentence.

我不确定如何删除B.1和B.2中的特定时间段。我不想删除任何其他期间。我想知道我该怎么做。谢谢。

2 个答案:

答案 0 :(得分:3)

lines <- c('B.1 Blah blah blah', 'Random sentence.', 'B.2 Blah blah blah', 'Random sentence.')

最初的方法(最文字)是寻找以下数字:

gsub("\\.([0-9])", "\\1", lines)
# [1] "B1 Blah blah blah" "Random sentence."  "B2 Blah blah blah" "Random sentence." 

如果重要的是缺少空格(或行尾),那么

gsub("\\.(\\S)", "\\1", lines)
# [1] "B1 Blah blah blah" "Random sentence."  "B2 Blah blah blah" "Random sentence." 

其中\\S表示对空格的取反。 (有关更多信息,请参见?regex`。)

如果存在合法的十进制数字(并且您的语言环境使用小数点的句点),这当然会失败:

lines <- c('B.1 Blah blah blah', 'Random sentence.', 'B.2 Blah blah blah', 'Random sentence.', 'pi is 3.14')
gsub("\\.(\\S)", "\\1", lines)
# [1] "B1 Blah blah blah" "Random sentence."  "B2 Blah blah blah" "Random sentence."  "pi is 314"        

此修复程序只是一个正则表达式:

gsub("([^0-9])\\.(\\S)", "\\1\\2", lines)
# [1] "B1 Blah blah blah" "Random sentence."  "B2 Blah blah blah" "Random sentence."  "pi is 3.14"       

尽管这现在不会捕获前导点:

lines <- c('B.1 Blah blah blah', 'Random sentence.', 'B.2 Blah blah blah', 'Random sentence.',
           'pi is 3.14', '.leading dots are bad.')
gsub("([^0-9])\\.(\\S)", "\\1\\2", lines)
# [1] "B1 Blah blah blah"      "Random sentence."       "B2 Blah blah blah"      "Random sentence."      
# [5] "pi is 3.14"             ".leading dots are bad."

所以我们要使事情变得更加复杂

gsub("(^|[^0-9])\\.(\\S)", "\\1\\2", lines)
# [1] "B1 Blah blah blah"     "Random sentence."      "B2 Blah blah blah"     "Random sentence."      "pi is 3.14"           
# [6] "leading dots are bad."

由于担心XKCD/1171 Perl Problems,这和我想来这里一样复杂。

答案 1 :(得分:1)

尽管@r2evans涵盖了几乎所有方面,但仍然考虑添加一个选项,该选项将检查.之后是alpha-bates,然后是digits,然后仅删除.

#Data
lines <- c("B.1 Blah blah blah", "Random sentence.", 
                       "B.2 Blah blah blah", "Random sentence.")

gsub("(.*[[:alpha:]]+)[.]([[:digit:]]+.*)","\\1\\2",lines)

#[1] "B1 Blah blah blah" "Random sentence."  "B2 Blah blah blah" "Random sentence." 

正则表达式说明

(.*[[:alpha:]]+)   : Group 1 for place holder. Anything that follows a alpha-bate
[.]                : .
([[:digit:]]+.*)   : At least a digit and then anything that follows it