从不均匀间隔的文本文件中提取表数据

时间:2018-03-14 06:58:58

标签: python ubuntu awk sed text-processing

         CLASS RECORD OF THE STUDENT FROM THE PREVIOUS BATCH WHO TOPPED
Name (Roll no) #    Location   Section     Rank (MARKS)     Gender   
Anna (+)            USA        A1          First (100)      Female
(04)                California V
ADDITIONAL RECORDS OF THE STUDENTS FROM THE PREVIOUS BATCH NEXT IN LIST
Name (Roll no) #    Location   Section     Rank (MARKS)     Gender
Bob (-)             USA        A2          First (99)       Male
(07)                Florida    VI
Eva (+)             USA        A4          Second (96)      Female
(12)                Ohio       V           English (99)
                                           Maths(100)
Other records are not available currently.Some records may be present which can be given on request.

使用 pdftotext 从PDF获取文本文件。使用下面的 AWK 命令,我得到了上述数据。
表数据空间分隔不均匀
删除整行所在的行大写
删除表格内容后的所有最后一行

pdftotext -layout INPUTFILE.pdf INPUTFILE.txt
awk '/RESULTS/{flag=1;next}/OTHER DATA/{flag=0}flag' INPUTFILE.txt | column -ts $'\t' -n

<小时/> 如何以制表符分隔格式(格式低于)获取表格数据?
代码以通用方式,因此它也适用于其他类型的表。

Name (Roll no) #    Location    Section     Rank (MARKS)    Gender  
Anna (+)            USA         A1          First (100)     Female
(04)                California  V
Bob (-)             USA         A2          First (99)      Male
(07)                Florida     VI
Eva (+)             USA         A4          Second (96)     Female
(12)                Ohio        V           English (99)
                                            Maths (100)

2 个答案:

答案 0 :(得分:1)

在删除不需要的行后,看起来提取的数据是固定宽度格式。你可以尝试

txt = """CLASS RECORD OF THE STUDENT FROM THE PREVIOUS BATCH WHO TOPPED
Name (Roll no) #    Location   Section     Rank (MARKS)     Gender   
Anna (+)            USA        A1          First (100)      Female
(04)                California V
ADDITIONAL RECORDS OF THE STUDENTS FROM THE PREVIOUS BATCH NEXT IN LIST
Name (Roll no) #    Location   Section     Rank (MARKS)     Gender
Bob (-)             USA        A2          First (99)       Male
(07)                Florida    VI
Eva (+)             USA        A4          Second (96)      Female
(12)                Ohio       V           English (99)
                                           Maths(100)
Other records are not available currently.Some records may be present which can be given on request"""

data = [[line[:20], line[20:31], line[31:43], line[60:]] 
        for line in txt.split('\n')[1:-1] if line != line.upper()]    # add .strip() if you want to remove the white space at beginning and the end
del data[3]   # Remove the header for additional records

>>> for line in data:
...     print(line)

# ['Name (Roll no) #    ', 'Location   ', 'Section     ', 'Rank (MARKS)     ', 'Gender   ']
# ['Anna (+)            ', 'USA        ', 'A1          ', 'First (100)      ', 'Female']
# ['(04)                ', 'California ', 'V', '', '']
# ['Bob (-)             ', 'USA        ', 'A2          ', 'First (99)       ', 'Male']
# ['(07)                ', 'Florida    ', 'VI', '', '']
# ['Eva (+)             ', 'USA        ', 'A4          ', 'Second (96)      ', 'Female']
# ['(12)                ', 'Ohio       ', 'V           ', 'English (99)', '']
# ['                    ', '           ', '            ', 'Maths(100)', '']

答案 1 :(得分:1)

我在这里介绍的方法是awk。我将在其中做出以下假设:

  • 标题行Name (Roll no) ... Gender可能多次出现
  • 标题行下的列表具有固定的字段宽度,但是fieldwidth是未知的。我假设它来自California行,因为该单词后面只有一个空格。
  • 在每个标题行之后,字段宽度可以更改。

awk中,我们可以使用内部变量FIELDWIDTHS设置固定的字段宽度:

  

FIELDWIDTHS # 以空格分隔的列列表,告诉gawk如何操作   具有固定柱状边界的分割输入。从4.2版开始,   每个字段宽度可以可选地以冒号分隔的值开头   指定字段开始前要跳过的字符数。   为FIELDWIDTHS分配值会覆盖FSFPAT的使用   场分裂。有关详细信息,请参阅Constant Size

     

注意:这是gawk扩展程序

要确定FIELDWIDTHS变量,我们会使用matchRSTART

  

RSTART 匹配的子字符串的字符的起始索引   通过match()函数(参见String Functions)。 RSTART设置为   调用match()函数。它的值是字符串的位置   匹配的子字符串开始的位置,如果未找到匹配则为零。

因此,这给了我们以下内容(注意OFS设置为|以证明正确的工作行为)

awk 'BEGIN{OFS="|"}
     /^[- A-Z]*$/{next}          # skips only caps lines
     /Other records might/{next} # skips the last line
     /^Name.*$/{                 # find header line
       match($0,"Location");i2=RSTART;
       match($0,"Section"); i3=RSTART;
       match($0,"Rank");    i4=RSTART;
       match($0,"Gender");  i5=RSTART;
       FIELDWIDTHS= i2-1" "i3-i2" "i4-i3" "i5-i4" 6"
       $0=$0                     # reprocess header line
       # print header line only the first time
       if (v==0) {print $1,$2,$3,$4,$5}
       v++; next      
     }
     {print $1,$2,$3,$4,$5}'

这已经输出

Name (Roll no) #    |Location   |Section     |Rank (MARKS)     |Gender
Anna (+)            |USA        |A1          |First (100)      |Female
(04)                |California |V||
Bob (-)             |USA        |A2          |First (99)       |Male
(07)                |Florida    |VI||
Eva (+)             |USA        |A4          |Second (96)      |Female
(12)                |Ohio       |V           |English (99)|
                    |           |            |Maths(100)|

评论:此时它已经看好了#34;确定&#34;,但考虑到每个标题行后列不需要相同的宽度(假设3)。

你想要一个制表符分隔的列系统,但是标签是邪恶的。一切都取决于您的系统如何解释选项卡的宽度。是48还是17。我在这里提出了一个空格分隔系统。最好的方法是从每个字段的末尾删除所有空格,然后使用命令column。这导致:

awk 'BEGIN{OFS="|"}
     /^[- A-Z]*$/{next}          # skips only caps lines
     /Other records might/{next} # skips the last line
     /^Name.*$/{                 # find header line
       match($0,"Location");i2=RSTART;
       match($0,"Section"); i3=RSTART;
       match($0,"Rank");    i4=RSTART;
       match($0,"Gender");  i5=RSTART;
       FIELDWIDTHS= i2-1" "i3-i2" "i4-i3" "i5-i4" 6"
       $0=$0                     # reprocess header line
       # print header line only the first time
       for(i=1;i<=NF;i++) sub(/ *$/,"",$i);
       if (v==0) {print $1,$2,$3,$4,$5}
       v++; next      
     }
     {
       for(i=1;i<=NF;i++) sub(/ *$/,"",$i);
       print $1,$2,$3,$4,$5
     }' <file> | column -t -s '|'

输出:

Name (Roll no) #  Location    Section  Rank (MARKS)  Gender  
Anna (+)          USA         A1       First (100)   Female  
(04)              California  V                              
Bob (-)           USA         A2       First (99)    Male    
(07)              Florida     VI                             
Eva (+)           USA         A4       Second (96)   Female  
(12)              Ohio        V        English (99)          
                                       Maths(100)          

备注column将根据需要调整列,因此每次都不必具有相同的宽度。如果您知道列宽,我建议您使用printf中的awk语句,该语句将是:

awk 'BEGIN{format="%-18s%-12s%-9s%-14s%-6s\n"}
     /^[- A-Z]*$/{next}          # skips only caps lines
     /Other records might/{next} # skips the last line
     /^Name.*$/{                 # find header line
       match($0,"Location");i2=RSTART;
       match($0,"Section"); i3=RSTART;
       match($0,"Rank");    i4=RSTART;
       match($0,"Gender");  i5=RSTART;
       FIELDWIDTHS= i2-1" "i3-i2" "i4-i3" "i5-i4" 6"
       $0=$0                     # reprocess header line
       # print header line only the first time
       if (v==0) {printf format,$1,$2,$3,$4,$5}
       v++; next      
     }
     { printf format,$1,$2,$3,$4,$5 }' <file>
输出为

Name (Roll no) #  Location    Section  Rank (MARKS)  Gender
Anna (+)          USA         A1       First (100)   Female
(04)              California  V                            
Bob (-)           USA         A2       First (99)    Male  
(07)              Florida     VI                           
Eva (+)           USA         A4       Second (96)   Female
(12)              Ohio        V        English (99)        
                                       Maths(100)