我使用Tesseract从扫描的PDF中提取文本。我有输出字符串就像这样..
Haemoglobin 13.5 14-16 g/dl
Random Blood Sugar 186 60 - 160 mg/dl
Random Urine Sugar Nil
¢ Blood Urea 43 14-40 mg/dl
4 — Serum Creatinine 2.13 0.4-1.5 mg/dl
Serum Uric Acid 4.9 3.4-7.0 mg/dl
Serum Sodium 142 135 - 150 meq/L
/ Serum Potassium 2.6 3.5-5.0 meq/L
Total Cholesterol] 146 110 - 160 mg/dl
Triglycerides 162 60 - 180 mg/d]
现在我必须将它提供给数据帧或csv,其中所有文本都在一列中,而值则在其他列中。
**Haemoglobin** 13.5 14-16 g/dl
**Random Blood Sugar** 186 60 - 160 mg/dl
到目前为止,我能做到的最好的就是这样......
text = text.split('\n')
text = [x.split(' ') for x in text]
df = pd.DataFrame(text, columns['Header','Detail','a','e,','b','c','d','f'])
df
Header Detail a e b c d f
0 Haemoglobin 13.5 14-16 g/dl None None None None
1 Random Blood Sugar 186 60 - 160 mg/dl
2 Random Urine Sugar Nil None None None None
请帮助!!
答案 0 :(得分:1)
我应该指出,这需要大量的工作,老实说,你还没有尝试过任何东西。但是为了帮助您在这里找到一个良好的代码,可以清除输入中的一些明显问题:
import re
def isnum(x):
try:
float(x)
return True
except:
return False
def clean_line(lnin):
# clean the leading garbage
ln=re.sub('^[^A-Za-z]+','',lnin).split()
for i in range(len(ln)):
if isnum(ln[i]):
ind=i
break
Header=' '.join(ln[:ind])
ln=[Header]+ln[ind:]
if '-' in ln:
ind=ln.index('-')
ln[ind-1]=ln[ind-1]+'-'+ln[ind+1]
del ln[ind:ind+2]
return ln
使用clean_line
功能清除每一行。然后,您可以将其提供给数据帧。
答案 1 :(得分:-1)
从最后开始向后工作,因为记录的其余部分似乎是固定格式,即向后工作,
字符串表示单位(没有空格): 号码: 破折号: 号码: 号码: 你想要的文字
Haemoglobin 13.5 14-16 g/dl
Field 5 (all characters backwards from end until space reached) = g/gl
Field 4 (jump over space, all characters backwards until space or dash reached) = 16
Field 3 (jump over space if present, pick up dash) = -
Field 2 (jump over space if present, all characters backwards until space reached) = 14
Field 1 (jump over space, all characters backwards until space reached) = 13.5
Field 0 (jump over space and take the rest) = Haemoglobin
Total Cholesterol] 146 110 - 160 mg/dl
Field 5 (all characters backwards from end until space reached) = mg/dl
Field 4 (jump over space, all characters backwards until space or dash reached) = 160
Field 3 (jump over space if present, pick up dash) = -
Field 2 (jump over space if present, all characters backwards until space reached) = 110
Field 1 (jump over space, all characters backwards until space reached) = 146
Field 0 (jump over space and take the rest) = Total Cholesterol]
答案 2 :(得分:-1)
使用正则表达式,下面的示例代码将使用标记将文本解析为CSV字符串:description,result,normal_value,unit。
请注意,列表test_results通常使用以下命令从文件中读取:
以open(' name_test_file')作为test_file: test_results = test_file.read()。splitlines()
import re
tests = 'Haemoglobin 13.5 14-16 g/dl\nRandom Blood Sugar 186 60 - 160 mg/dl\n'\
'Random Urine Sugar Nil\n¢ Blood Urea 43 14-40 mg/dl\n'\
'4 — Serum Creatinine 2.13 0.4-1.5 mg/dl\n'\
'Serum Uric Acid 4.9 3.4-7.0 mg/dl\nSerum Sodium 142 135 - 150 meq/L\n'\
'/ Serum Potassium 2.6 3.5-5.0 meq/L\n'\
'Total Cholesterol] 146 110 - 160 mg/dl\n'\
'Triglycerides 162 60 - 180 mg/d]\n'
test_results = tests.splitlines()
for test_result in test_results:
print('input :', test_result)
m = re.search(r'.*?(?=[a-zA-Z][a-zA-Z])(?P<description>.*?)(?=[ ][0-9])'
r'[ ](?P<result>[0-9.]*?)(?=[ ][0-9])'
r'[ ](?P<normal_value>[ 0-9.\-]*?)(?=[ ][a-zA-Z])'
r'[ ](?P<unit>.[ a-zA-Z/]*)',
test_result)
if m is not None:
normal_value = m.group('normal_value')
unit = m.group('unit')
else:
m = re.search(r'.*?(?=[a-zA-Z][a-zA-Z])(?P<description>.*?)(?=[ ]Nil)'
r'[ ](?P<result>Nil).*',
test_result)
normal_value = ''
unit = ''
if m is not None:
description = m.group('description')
result = m.group('result')
else:
description = test_result
result = ''
write_string = description + ',' + result + ',' + normal_value + ',' + unit
print(write_string)