我有一个包含4个数据部分的数据文件。页眉数据,摘要数据,明细数据和页脚数据。每个部分都有固定的列数。每个部分都被两行所分隔,每行只有一个“#”作为行内容。但是不同的部分具有不同的列。有没有一种方法可以避免创建新文件,而仅使用spark tsv(制表符分隔的foramt)模块或任何其他模块将文件直接读取到4个数据集中。如果我直接读取文件,则在接下来的内容中失去了多余的列数据部分。它仅从文件中读取作为文件第一行的那些列。
#deptno dname location
10 Accounting New York
20 Research Dallas
30 Sales Chicago
40 Operations Boston
#
#
#grade losal hisal
1 700.00 1200.00
2 1201.00 1400.00
4 2001.00 3000.00
5 3001.00 99999.00
3 1401.00 2000.00
#
#
#ENAME DNAME JOB EMPNO HIREDATE LOC
ADAMS RESEARCH CLERK 7876 23-MAY-87 DALLAS
ALLEN SALES SALESMAN 7499 20-FEB-81 CHICAGO
BLAKE SALES MANAGER 7698 01-MAY-81 CHICAGO
CLARK ACCOUNTING MANAGER 7782 09-JUN-81 NEW YORK
FORD RESEARCH ANALYST 7902 03-DEC-81 DALLAS
JAMES SALES CLERK 7900 03-DEC-81 CHICAGO
JONES RESEARCH MANAGER 7566 02-APR-81 DALLAS
#
#
#Name Age Address
Paul 23 1115 W Franklin
Bessy the Cow 5 Big Farm Way
Zeke 45 W Main St
输出:
Dataset d1 :
#deptno dname location
10 Accounting New York
20 Research Dallas
30 Sales Chicago
40 Operations Boston
Dataset d2 :
#grade losal hisal
1 700.00 1200.00
2 1201.00 1400.00
4 2001.00 3000.00
5 3001.00 99999.00
3 1401.00 2000.00
Dataset d3 :
#ENAME DNAME JOB EMPNO HIREDATE LOC
ADAMS RESEARCH CLERK 7876 23-MAY-87 DALLAS
ALLEN SALES SALESMAN 7499 20-FEB-81 CHICAGO
BLAKE SALES MANAGER 7698 01-MAY-81 CHICAGO
CLARK ACCOUNTING MANAGER 7782 09-JUN-81 NEW YORK
FORD RESEARCH ANALYST 7902 03-DEC-81 DALLAS
JAMES SALES CLERK 7900 03-DEC-81 CHICAGO
JONES RESEARCH MANAGER 7566 02-APR-81 DALLAS
Dataset d4 :
#Name Age Address
Paul23 1115 W Franklin
Bessy the Cow 5 Big Farm Way
Zeke 45 W Main St