我有一个包含一些DNA序列的文本文件。它在一条线上,但我想将它分成多行。
>JH739887TTTACAATGTATAATAGAAACTAAAACTGAAATGTTAATCTTGAAATTTAAGAATCTTCAAAAATGTTTAAGTGGTGATAATCTCCCCAGTGTGAGAAACACACTTGGAAGGAAGTCACAAGTCAAATTTAGATTTGTTGCTTAATAATGGATTTGTAAGTATTATCAAATACTCAAGCACtaaggaaacaggaaaatctgaaatgttCACTTGCTTCTAAACATTTGCAGCCGAGTCCAACTTACACAGGGTAAGATGAGTTTTACAGACAGACACTATTTGTTATTAGGTCAGCTACAGTAAGTGAAAAAACTCACCTCTTTAAGTCTGATAAAGTAGCAGAAagtcatattttaaatatcagtaTAAACAAATGCTCTAAGTTTGGAAATGTTAATCTTGAAAGAACCTTCAAAAACATTTAAGTGCTGGTTATCTCCCCAGTGTGT>JH739882TTTACAATGTATAATAGAAACTAAAACTGAAATGTTAATCTTGAAATTTAAGAATCTTCAAAAATGTTTAAGTGGTGATAATCTCCCCAGTGTGAGAAACACACTTGGAAGGAAGTCACAAGTCAAATTTAGATTTGTTGCTTAATAATGGATTTGTAAGTATTATCAAATACTCAAGCACtaaggaaacaggaaaatctgaaatgttCACTTGCTTCTAAACATTTGCAGCCGAGTCCAACTTACACAGGGTAAGATGAGTTTTACAGACAGACACTATTTGTTATTAGGTCAGCTACAGTAAGTGAAAAAACTCACCTCTTTAAGTCTGATAAAGTAGCAGAAagtcatattttaaatatcagtaTAAACAAATGCTCTAAGTTTGGAAATGTTAATCTTGAAAGAACCTTCAAAAACATTTAAGTGCTGGTTATCTCCCCAGTGTGT
我能够查看我想要分隔文件的地方:
grep '>[A-Z]\{2\}[0-9]\{6\}' ~/Desktop/text2.txt
正则表达式"> [A-Z] {2} [0-9] {6}"寻找模式">JH######
。
但是每当我使用sed命令在正则表达式搜索之前和之后添加一行返回时,它都不起作用:
sed '/>[A-Z]\{2\}[0-9]\{6\}/a/b\
\n' ~/Desktop/text2.txt
这是我的错误:
sed: 1: "/>[A-Z]\{2\}[0-9]\{6\}/ ...": command a expects \ followed by text
以下命令正在运行,但未给出预期结果:
sed '/>[A-Z]\{2\}[0-9]\{6\}/a\
\n' ~/Desktop/text2.txt
这是我期待的结果(第一行不应该在它之前返回,但对于其余的匹配,它们应该在前后返回一行,行返回¬
为清楚起见,此处包含在内:
>JH739887¬
TTTACAATGTATAATAGAAACTAAAACTGAAATGTTAATCTTGAAATTTAAGAATCTTCAAAAATGTTTAAGTGGTGATAATCTCCCCAGTGTGAGAAACACACTTGGAAGGAAGTCACAAGTCAAATTTAGATTTGTTGCTTAATAATGGATTTGTAAGTATTATCAAATACTCAAGCACtaaggaaacaggaaaatctgaaatgttCACTTGCTTCTAAACATTTGCAGCCGAGTCCAACTTACACAGGGTAAGATGAGTTTTACAGACAGACACTATTTGTTATTAGGTCAGCTACAGTAAGTGAAAAAACTCACCTCTTTAAGTCTGATAAAGTAGCAGAAagtcatattttaaatatcagtaTAAACAAATGCTCTAAGTTTGGAAATGTTAATCTTGAAAGAACCTTCAAAAACATTTAAGTGCTGGTTATCTCCCCAGTGTGT¬
>JH739882¬
TTTACAATGTATAATAGAAACTAAAACTGAAATGTTAATCTTGAAATTTAAGAATCTTCAAAAATGTTTAAGTGGTGATAATCTCCCCAGTGTGAGAAACACACTTGGAAGGAAGTCACAAGTCAAATTTAGATTTGTTGCTTAATAATGGATTTGTAAGTATTATCAAATACTCAAGCACtaaggaaacaggaaaatctgaaatgttCACTTGCTTCTAAACATTTGCAGCCGAGTCCAACTTACACAGGGTAAGATGAGTTTTACAGACAGACACTATTTGTTATTAGGTCAGCTACAGTAAGTGAAAAAACTCACCTCTTTAAGTCTGATAAAGTAGCAGAAagtcatattttaaatatcagtaTAAACAAATGCTCTAAGTTTGGAAATGTTAATCTTGAAAGAACCTTCAAAAACATTTAAGTGCTGGTTATCTCCCCAGTGTGT¬
答案 0 :(得分:1)
试试这个:
sed 's/>[A-Z]\{2\}[0-9]\{6\}/\n&\n/g;s/^\n//' file
s/>[A-Z]\{2\}[0-9]\{6\}/\n&\n/g
:在每个匹配字符串s/^\n//
:删除第一行添加的换行符答案 1 :(得分:0)
我讨厌sed,但这是一个有趣的挑战:
root@dbe57bdfb014:/tensorflow# python
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf, sys
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "tensorflow/__init__.py", line 24, in <module>
from tensorflow.python import *
File "tensorflow/python/__init__.py", line 49, in <module>
from tensorflow.python import pywrap_tensorflow
File "tensorflow/python/pywrap_tensorflow.py", line 25, in <module>
from tensorflow.python.platform import self_check
ImportError: No module named platform
>>> image_path = sys.argv[1]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'sys' is not defined
这样是一个有趣的挑战,了解这个程序是留给读者的练习。
答案 2 :(得分:0)
使用GNU grep,你可以写
grep -oP '>[A-Z]{2}\d{6}|(?<=>.{8})[^>]+' file
但是你的Mac上可能没有GNU grep。试试普通perl
perl -pe 'chomp; s/(>[A-Z]{2}\d{6})([^>]+)/$1\n$2\n/g' file