我有一个数据集(对于那里的compbio人,它是一个FASTA),里面堆满了换行符,不作为数据的分隔符。
使用任何pandas读取函数时,pandas有没有办法在导入时忽略换行?
示例数据:
> ERR899297.10000174 TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC TATCAAGATCAGCCGATTCT
每个条目都以“>”分隔 数据按换行符分类(仅限于全球,但实际上并未受到尊重 每行80个字符)
答案 0 :(得分:0)
当你真正想要改变元组时,你需要有另一个告诉熊猫的标志。
例如,我创建了一个文件,其中新行由管道(|)编码:
csv = """
col1,col2, col3, col4|
first_col_first_line,2nd_col_first_line,
3rd_col_first_line
de,4rd_col_first_line|
"""
with open("test.csv", "w") as f:
f.writelines(csv)
然后用C引擎读取它并将管道精确地作为行终止符:
import pandas as pd
pd.read_csv("test.csv",lineterminator="|", engine="c")
答案 1 :(得分:0)
没有好办法做到这一点。 单独的BioPython似乎已经足够了,涉及迭代BioPython对象并插入数据框的混合解决方案
答案 2 :(得分:0)
使用任何熊猫读取功能,有没有办法让熊猫在导入时忽略换行符?
是的,只需查看pd.read_table()
的文档
您要指定自定义行终止符(>
),然后适当地处理换行符(\n
):将第一个用作str.split(maxsplit=1)的列定界符,并忽略后续的换行符与str.replace(直到下一个终止符):
#---- EXAMPLE DATA ---
from io import StringIO
example_file = StringIO(
"""
>ERR899297.10000174
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
; this comment should not be read into a dataframe
>ERR123456.12345678
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
; this comment should not be read into a dataframe
"""
)
#----------------------
#---- EXAMPLE CODE ---
import pandas as pd
df = pd.read_table(
example_file, # Your file goes here
engine = 'c', # C parser must be used to allow custom lineterminator, see doc
lineterminator = '>', # New lines begin with ">"
skiprows =1, # File begins with line terminator ">", so output skips first line
names = ['raw'], # A single column which we will split into two
comment = ';' # comment character in FASTA format
)
# The first line break ('\n') separates Column 0 from Column 1
df[['col0','col1']] = pd.DataFrame.from_records(df.raw.apply(lambda s: s.split(maxsplit=1)))
# All subsequent line breaks (which got left in Column 1) should be ignored
df['col1'] = df['col1'].apply(lambda s: s.replace('\n',''))
print(df[['col0','col1']])
# Show that col1 no longer contains line breaks
print('\nExample sequence is:')
print(df['col1'][0])
返回:
col0 col1
0 ERR899297.10000174 TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTA...
1 ERR123456.12345678 TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTA...
Example sequence is:
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGCTATCAAGATCAGCCGATTCT
答案 3 :(得分:0)
在 pd.read_csv()
之后,您可以使用 df.split()
。
import pandas as pd
data = pd.read_csv("test.csv")
data.split()