Question

我正在尝试导入一个非常大的数据文件。它是一个像

这样的文本文件

***** Information about Data ***********
Information about data
Information about Data
Information about Data

Information about Data

    Col1     Col2
     1.0      1.0
     1.0      1.0
     1.0      1.0
     1.0      1.0
     ...(10k+ lines)
     1.0      1.0
     1.0      1.0
***** Information about Data ***********
Information about data
Information about Data
Information about Data

Information about Data

    Col1     Col2
     1.0      1.0
     1.0      1.0
     1.0      1.0
     1.0      1.0
     ...(10k+ lines)
     1.0      1.0
     1.0      1.0

并重复一些任意次数。标题之间的行数不同，总文件大于100万行。

是否有一种剥离此标题而不逐行查找的方法？我已经编写了逐行搜索，但这实在太慢了。

标题每次显示时都会略有不同。

Answer 1

假设您的文件名为test.txt

将整个文件作为字符串读入

split

'\n*'

     new line
             \ 
  1.0      1.0
***** Information about Data ***********
 \
  followed by astricks

rsplit '\n\n'结果，最后

       first new line
                     \
Information about Data

 \
  second new line
    Col1     Col2
     1.0      1.0
     1.0      1.0
     1.0      1.0

read_csv
pd.concat

from io import StringIO
import pandas as pd

def rtxt(txt):
    return pd.read_csv(StringIO(txt), delim_whitespace=True)

fname = 'test.txt'

pd.concat(
    [rtxt(st.rsplit('\n\n', 1)[-1])
     for st in open(fname).read().split('\n*')],
    ignore_index=True
)

    Col1  Col2
0    1.0   1.0
1    1.0   1.0
2    1.0   1.0
3    1.0   1.0
4    1.0   1.0
5    1.0   1.0
6    1.0   1.0
7    1.0   1.0
8    1.0   1.0
9    1.0   1.0
10   1.0   1.0
11   1.0   1.0

在表

1 个答案: