pandas读取csv忽略换行符

时间:2018-02-08 21:07:06

标签: python pandas biopython

我有一个数据集(对于那里的compbio人,它是一个FASTA),里面堆满了换行符,不作为数据的分隔符。

使用任何pandas读取函数时,pandas有没有办法在导入时忽略换行?

示例数据:

  

> ERR899297.10000174   TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC   TATCAAGATCAGCCGATTCT

每个条目都以“>”分隔 数据按换行符分类(仅限于全球,但实际上并未受到尊重 每行80个字符)

4 个答案:

答案 0 :(得分:0)

当你真正想要改变元组时,你需要有另一个告诉熊猫的标志。

例如,我创建了一个文件,其中新行由管道(|)编码:

csv = """
col1,col2, col3, col4|
first_col_first_line,2nd_col_first_line,
3rd_col_first_line

de,4rd_col_first_line|
"""
with open("test.csv", "w") as f:
    f.writelines(csv)

然后用C引擎读取它并将管道精确地作为行终止符:

import pandas as pd
pd.read_csv("test.csv",lineterminator="|", engine="c")

给了我: enter image description here

答案 1 :(得分:0)

没有好办法做到这一点。 单独的BioPython似乎已经足够了,涉及迭代BioPython对象并插入数据框的混合解决方案

答案 2 :(得分:0)

使用任何熊猫读取功能,有没有办法让熊猫在导入时忽略换行符?

是的,只需查看pd.read_table()的文档

您要指定自定义行终止符(>),然后适当地处理换行符(\n):将第一个用作str.split(maxsplit=1)的列定界符,并忽略后续的换行符与str.replace(直到下一个终止​​符):

#---- EXAMPLE DATA ---
from io import StringIO
example_file = StringIO(
"""
>ERR899297.10000174 
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
; this comment should not be read into a dataframe
>ERR123456.12345678
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGC
TATCAAGATCAGCCGATTCT
; this comment should not be read into a dataframe
"""
)
#----------------------


#---- EXAMPLE CODE ---
import pandas as pd
df = pd.read_table(
    example_file,           # Your file goes here
    engine = 'c',           # C parser must be used to allow custom lineterminator, see doc
    lineterminator = '>',   # New lines begin with ">"
    skiprows =1,            # File begins with line terminator ">", so output skips first line 
    names = ['raw'],        # A single column which we will split into two
    comment = ';'           # comment character in FASTA format
)

# The first line break ('\n') separates Column 0 from Column 1
df[['col0','col1']] = pd.DataFrame.from_records(df.raw.apply(lambda s: s.split(maxsplit=1)))

# All subsequent line breaks (which got left in Column 1) should be ignored
df['col1'] = df['col1'].apply(lambda s: s.replace('\n',''))

print(df[['col0','col1']])

# Show that col1 no longer contains line breaks
print('\nExample sequence is:')
print(df['col1'][0])

返回:

                 col0                                               col1
0  ERR899297.10000174  TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTA...
1  ERR123456.12345678  TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTA...

Example sequence is:
TGTAATATTGCCTGTAGCGGGAGTTGTTGTCTCAGGATCAGCATTATATATCTCAATTGCATGAATCATCGTATTAATGCTATCAAGATCAGCCGATTCT

答案 3 :(得分:0)

pd.read_csv() 之后,您可以使用 df.split()

 import pandas as pd


 data = pd.read_csv("test.csv")
 data.split()