在正则表达式模式之前和之后的正则表达式搜索之后如何添加换行符?

时间:2017-11-10 20:06:33

标签: macos unix sed terminal

我有一个包含一些DNA序列的文本文件。它在一条线上,但我想将它分成多行。

>JH739887TTTACAATGTATAATAGAAACTAAAACTGAAATGTTAATCTTGAAATTTAAGAATCTTCAAAAATGTTTAAGTGGTGATAATCTCCCCAGTGTGAGAAACACACTTGGAAGGAAGTCACAAGTCAAATTTAGATTTGTTGCTTAATAATGGATTTGTAAGTATTATCAAATACTCAAGCACtaaggaaacaggaaaatctgaaatgttCACTTGCTTCTAAACATTTGCAGCCGAGTCCAACTTACACAGGGTAAGATGAGTTTTACAGACAGACACTATTTGTTATTAGGTCAGCTACAGTAAGTGAAAAAACTCACCTCTTTAAGTCTGATAAAGTAGCAGAAagtcatattttaaatatcagtaTAAACAAATGCTCTAAGTTTGGAAATGTTAATCTTGAAAGAACCTTCAAAAACATTTAAGTGCTGGTTATCTCCCCAGTGTGT>JH739882TTTACAATGTATAATAGAAACTAAAACTGAAATGTTAATCTTGAAATTTAAGAATCTTCAAAAATGTTTAAGTGGTGATAATCTCCCCAGTGTGAGAAACACACTTGGAAGGAAGTCACAAGTCAAATTTAGATTTGTTGCTTAATAATGGATTTGTAAGTATTATCAAATACTCAAGCACtaaggaaacaggaaaatctgaaatgttCACTTGCTTCTAAACATTTGCAGCCGAGTCCAACTTACACAGGGTAAGATGAGTTTTACAGACAGACACTATTTGTTATTAGGTCAGCTACAGTAAGTGAAAAAACTCACCTCTTTAAGTCTGATAAAGTAGCAGAAagtcatattttaaatatcagtaTAAACAAATGCTCTAAGTTTGGAAATGTTAATCTTGAAAGAACCTTCAAAAACATTTAAGTGCTGGTTATCTCCCCAGTGTGT

我能够查看我想要分隔文件的地方:

grep '>[A-Z]\{2\}[0-9]\{6\}' ~/Desktop/text2.txt

正则表达式"> [A-Z] {2} [0-9] {6}"寻找模式">JH######

但是每当我使用sed命令在正则表达式搜索之前和之后添加一行返回时,它都不起作用:

sed  '/>[A-Z]\{2\}[0-9]\{6\}/a/b\ 
\n' ~/Desktop/text2.txt

这是我的错误:

sed: 1: "/>[A-Z]\{2\}[0-9]\{6\}/ ...": command a expects \ followed by text

以下命令正在运行,但未给出预期结果:

sed  '/>[A-Z]\{2\}[0-9]\{6\}/a\ 
\n' ~/Desktop/text2.txt

这是我期待的结果(第一行不应该在它之前返回,但对于其余的匹配,它们应该在前后返回一行,行返回¬为清楚起见,此处包含在内:

>JH739887¬
TTTACAATGTATAATAGAAACTAAAACTGAAATGTTAATCTTGAAATTTAAGAATCTTCAAAAATGTTTAAGTGGTGATAATCTCCCCAGTGTGAGAAACACACTTGGAAGGAAGTCACAAGTCAAATTTAGATTTGTTGCTTAATAATGGATTTGTAAGTATTATCAAATACTCAAGCACtaaggaaacaggaaaatctgaaatgttCACTTGCTTCTAAACATTTGCAGCCGAGTCCAACTTACACAGGGTAAGATGAGTTTTACAGACAGACACTATTTGTTATTAGGTCAGCTACAGTAAGTGAAAAAACTCACCTCTTTAAGTCTGATAAAGTAGCAGAAagtcatattttaaatatcagtaTAAACAAATGCTCTAAGTTTGGAAATGTTAATCTTGAAAGAACCTTCAAAAACATTTAAGTGCTGGTTATCTCCCCAGTGTGT¬
>JH739882¬
TTTACAATGTATAATAGAAACTAAAACTGAAATGTTAATCTTGAAATTTAAGAATCTTCAAAAATGTTTAAGTGGTGATAATCTCCCCAGTGTGAGAAACACACTTGGAAGGAAGTCACAAGTCAAATTTAGATTTGTTGCTTAATAATGGATTTGTAAGTATTATCAAATACTCAAGCACtaaggaaacaggaaaatctgaaatgttCACTTGCTTCTAAACATTTGCAGCCGAGTCCAACTTACACAGGGTAAGATGAGTTTTACAGACAGACACTATTTGTTATTAGGTCAGCTACAGTAAGTGAAAAAACTCACCTCTTTAAGTCTGATAAAGTAGCAGAAagtcatattttaaatatcagtaTAAACAAATGCTCTAAGTTTGGAAATGTTAATCTTGAAAGAACCTTCAAAAACATTTAAGTGCTGGTTATCTCCCCAGTGTGT¬

3 个答案:

答案 0 :(得分:1)

试试这个:

sed  's/>[A-Z]\{2\}[0-9]\{6\}/\n&\n/g;s/^\n//' file
  • s/>[A-Z]\{2\}[0-9]\{6\}/\n&\n/g:在每个匹配字符串
  • 之前和之后添加换行符
  • s/^\n//:删除第一行添加的换行符

答案 1 :(得分:0)

我讨厌sed,但这是一个有趣的挑战:

root@dbe57bdfb014:/tensorflow# python
Python 2.7.12 (default, Nov 19 2016, 06:48:10) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf, sys 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import *
  File "tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "tensorflow/python/pywrap_tensorflow.py", line 25, in <module>
    from tensorflow.python.platform import self_check
ImportError: No module named platform
>>> image_path = sys.argv[1]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'sys' is not defined

这样是一个有趣的挑战,了解这个程序是留给读者的练习。

答案 2 :(得分:0)

使用GNU grep,你可以写

grep -oP '>[A-Z]{2}\d{6}|(?<=>.{8})[^>]+' file

但是你的Mac上可能没有GNU grep。试试普通perl

perl -pe 'chomp; s/(>[A-Z]{2}\d{6})([^>]+)/$1\n$2\n/g' file