CLASS RECORD OF THE STUDENT FROM THE PREVIOUS BATCH WHO TOPPED
Name (Roll no) # Location Section Rank (MARKS) Gender
Anna (+) USA A1 First (100) Female
(04) California V
ADDITIONAL RECORDS OF THE STUDENTS FROM THE PREVIOUS BATCH NEXT IN LIST
Name (Roll no) # Location Section Rank (MARKS) Gender
Bob (-) USA A2 First (99) Male
(07) Florida VI
Eva (+) USA A4 Second (96) Female
(12) Ohio V English (99)
Maths(100)
Other records are not available currently.Some records may be present which can be given on request.
使用 pdftotext 从PDF获取文本文件。使用下面的 AWK 命令,我得到了上述数据。
表数据空间分隔不均匀。
删除整行所在的行大写
删除表格内容后的所有最后一行。
pdftotext -layout INPUTFILE.pdf INPUTFILE.txt
awk '/RESULTS/{flag=1;next}/OTHER DATA/{flag=0}flag' INPUTFILE.txt | column -ts $'\t' -n
<小时/> 如何以制表符分隔格式(格式低于)获取表格数据?
Name (Roll no) # Location Section Rank (MARKS) Gender
Anna (+) USA A1 First (100) Female
(04) California V
Bob (-) USA A2 First (99) Male
(07) Florida VI
Eva (+) USA A4 Second (96) Female
(12) Ohio V English (99)
Maths (100)
答案 0 :(得分:1)
在删除不需要的行后,看起来提取的数据是固定宽度格式。你可以尝试
txt = """CLASS RECORD OF THE STUDENT FROM THE PREVIOUS BATCH WHO TOPPED
Name (Roll no) # Location Section Rank (MARKS) Gender
Anna (+) USA A1 First (100) Female
(04) California V
ADDITIONAL RECORDS OF THE STUDENTS FROM THE PREVIOUS BATCH NEXT IN LIST
Name (Roll no) # Location Section Rank (MARKS) Gender
Bob (-) USA A2 First (99) Male
(07) Florida VI
Eva (+) USA A4 Second (96) Female
(12) Ohio V English (99)
Maths(100)
Other records are not available currently.Some records may be present which can be given on request"""
data = [[line[:20], line[20:31], line[31:43], line[60:]]
for line in txt.split('\n')[1:-1] if line != line.upper()] # add .strip() if you want to remove the white space at beginning and the end
del data[3] # Remove the header for additional records
>>> for line in data:
... print(line)
# ['Name (Roll no) # ', 'Location ', 'Section ', 'Rank (MARKS) ', 'Gender ']
# ['Anna (+) ', 'USA ', 'A1 ', 'First (100) ', 'Female']
# ['(04) ', 'California ', 'V', '', '']
# ['Bob (-) ', 'USA ', 'A2 ', 'First (99) ', 'Male']
# ['(07) ', 'Florida ', 'VI', '', '']
# ['Eva (+) ', 'USA ', 'A4 ', 'Second (96) ', 'Female']
# ['(12) ', 'Ohio ', 'V ', 'English (99)', '']
# [' ', ' ', ' ', 'Maths(100)', '']
答案 1 :(得分:1)
我在这里介绍的方法是awk
。我将在其中做出以下假设:
Name (Roll no) ... Gender
可能多次出现California
行,因为该单词后面只有一个空格。在awk
中,我们可以使用内部变量FIELDWIDTHS
设置固定的字段宽度:
FIELDWIDTHS #
以空格分隔的列列表,告诉gawk如何操作 具有固定柱状边界的分割输入。从4.2版开始, 每个字段宽度可以可选地以冒号分隔的值开头 指定字段开始前要跳过的字符数。 为FIELDWIDTHS
分配值会覆盖FS
和FPAT
的使用 场分裂。有关详细信息,请参阅Constant Size。注意:这是
gawk
扩展程序
要确定FIELDWIDTHS
变量,我们会使用match
和RSTART
:
RSTART
匹配的子字符串的字符的起始索引 通过match()
函数(参见String Functions)。RSTART
设置为 调用match()
函数。它的值是字符串的位置 匹配的子字符串开始的位置,如果未找到匹配则为零。
因此,这给了我们以下内容(注意OFS
设置为|
以证明正确的工作行为)
awk 'BEGIN{OFS="|"}
/^[- A-Z]*$/{next} # skips only caps lines
/Other records might/{next} # skips the last line
/^Name.*$/{ # find header line
match($0,"Location");i2=RSTART;
match($0,"Section"); i3=RSTART;
match($0,"Rank"); i4=RSTART;
match($0,"Gender"); i5=RSTART;
FIELDWIDTHS= i2-1" "i3-i2" "i4-i3" "i5-i4" 6"
$0=$0 # reprocess header line
# print header line only the first time
if (v==0) {print $1,$2,$3,$4,$5}
v++; next
}
{print $1,$2,$3,$4,$5}'
这已经输出
Name (Roll no) # |Location |Section |Rank (MARKS) |Gender
Anna (+) |USA |A1 |First (100) |Female
(04) |California |V||
Bob (-) |USA |A2 |First (99) |Male
(07) |Florida |VI||
Eva (+) |USA |A4 |Second (96) |Female
(12) |Ohio |V |English (99)|
| | |Maths(100)|
评论:此时它已经看好了#34;确定&#34;,但考虑到每个标题行后列不需要相同的宽度(假设3)。
你想要一个制表符分隔的列系统,但是标签是邪恶的。一切都取决于您的系统如何解释选项卡的宽度。是4
,8
还是17
。我在这里提出了一个空格分隔系统。最好的方法是从每个字段的末尾删除所有空格,然后使用命令column
。这导致:
awk 'BEGIN{OFS="|"}
/^[- A-Z]*$/{next} # skips only caps lines
/Other records might/{next} # skips the last line
/^Name.*$/{ # find header line
match($0,"Location");i2=RSTART;
match($0,"Section"); i3=RSTART;
match($0,"Rank"); i4=RSTART;
match($0,"Gender"); i5=RSTART;
FIELDWIDTHS= i2-1" "i3-i2" "i4-i3" "i5-i4" 6"
$0=$0 # reprocess header line
# print header line only the first time
for(i=1;i<=NF;i++) sub(/ *$/,"",$i);
if (v==0) {print $1,$2,$3,$4,$5}
v++; next
}
{
for(i=1;i<=NF;i++) sub(/ *$/,"",$i);
print $1,$2,$3,$4,$5
}' <file> | column -t -s '|'
输出:
Name (Roll no) # Location Section Rank (MARKS) Gender
Anna (+) USA A1 First (100) Female
(04) California V
Bob (-) USA A2 First (99) Male
(07) Florida VI
Eva (+) USA A4 Second (96) Female
(12) Ohio V English (99)
Maths(100)
备注column
将根据需要调整列,因此每次都不必具有相同的宽度。如果您知道列宽,我建议您使用printf
中的awk
语句,该语句将是:
awk 'BEGIN{format="%-18s%-12s%-9s%-14s%-6s\n"}
/^[- A-Z]*$/{next} # skips only caps lines
/Other records might/{next} # skips the last line
/^Name.*$/{ # find header line
match($0,"Location");i2=RSTART;
match($0,"Section"); i3=RSTART;
match($0,"Rank"); i4=RSTART;
match($0,"Gender"); i5=RSTART;
FIELDWIDTHS= i2-1" "i3-i2" "i4-i3" "i5-i4" 6"
$0=$0 # reprocess header line
# print header line only the first time
if (v==0) {printf format,$1,$2,$3,$4,$5}
v++; next
}
{ printf format,$1,$2,$3,$4,$5 }' <file>
输出为:
Name (Roll no) # Location Section Rank (MARKS) Gender
Anna (+) USA A1 First (100) Female
(04) California V
Bob (-) USA A2 First (99) Male
(07) Florida VI
Eva (+) USA A4 Second (96) Female
(12) Ohio V English (99)
Maths(100)