将扫描的PDF提取文本导入CSV

时间:2018-06-01 16:22:28

标签: python regex csv dataframe python-tesseract

我使用Tesseract从扫描的PDF中提取文本。我有输出字符串就像这样..

Haemoglobin 13.5 14-16 g/dl
Random Blood Sugar 186 60 - 160 mg/dl
Random Urine Sugar Nil
¢ Blood Urea 43 14-40 mg/dl
4 — Serum Creatinine 2.13 0.4-1.5 mg/dl
Serum Uric Acid 4.9 3.4-7.0 mg/dl
Serum Sodium 142 135 - 150 meq/L
/ Serum Potassium 2.6 3.5-5.0 meq/L
Total Cholesterol] 146 110 - 160 mg/dl
Triglycerides 162 60 - 180 mg/d]

现在我必须将它提供给数据帧或csv,其中所有文本都在一列中,而值则在其他列中。

**Haemoglobin**            13.5   14-16     g/dl
**Random Blood Sugar**     186    60 - 160  mg/dl

到目前为止,我能做到的最好的就是这样......

  text = text.split('\n')
  text = [x.split(' ') for x in text]
df = pd.DataFrame(text, columns['Header','Detail','a','e,','b','c','d','f'])
df

    Header    Detail   a      e     b      c      d  f
0 Haemoglobin 13.5    14-16   g/dl  None   None  None  None
1 Random      Blood   Sugar   186   60      -     160  mg/dl
2 Random      Urine   Sugar   Nil   None   None  None  None

请帮助!!

3 个答案:

答案 0 :(得分:1)

我应该指出,这需要大量的工作,老实说,你还没有尝试过任何东西。但是为了帮助您在这里找到一个良好的代码,可以清除输入中的一些明显问题:

import re
def isnum(x):
    try:
        float(x)
        return True
    except:
        return False

def clean_line(lnin):
    # clean the leading garbage
    ln=re.sub('^[^A-Za-z]+','',lnin).split()
    for i in range(len(ln)):
        if isnum(ln[i]):
            ind=i
            break
    Header=' '.join(ln[:ind])
    ln=[Header]+ln[ind:]
    if '-' in ln:
        ind=ln.index('-')
        ln[ind-1]=ln[ind-1]+'-'+ln[ind+1]
        del ln[ind:ind+2]
    return ln

使用clean_line功能清除每一行。然后,您可以将其提供给数据帧。

答案 1 :(得分:-1)

从最后开始向后工作,因为记录的其余部分似乎是固定格式,即向后工作,

字符串表示单位(没有空格): 号码: 破折号: 号码: 号码: 你想要的文字

Haemoglobin 13.5 14-16 g/dl
Field 5 (all characters backwards from end until space reached) = g/gl
Field 4 (jump over space, all characters backwards until space or dash reached) = 16
Field 3 (jump over space if present, pick up dash) = -
Field 2 (jump over space if present, all characters backwards until space reached) = 14
Field 1 (jump over space, all characters backwards until space reached) = 13.5
Field 0 (jump over space and take the rest) = Haemoglobin

Total Cholesterol] 146 110 - 160 mg/dl
Field 5 (all characters backwards from end until space reached) = mg/dl
Field 4 (jump over space, all characters backwards until space or dash reached) = 160
Field 3 (jump over space if present, pick up dash) = -
Field 2 (jump over space if present, all characters backwards until space reached) = 110
Field 1 (jump over space, all characters backwards until space reached) = 146
Field 0 (jump over space and take the rest) = Total Cholesterol]

答案 2 :(得分:-1)

使用正则表达式,下面的示例代码将使用标记将文本解析为CSV字符串:description,result,normal_value,unit。

请注意,列表test_results通常使用以下命令从文件中读取:

以open(' name_test_file')作为test_file:     test_results = test_file.read()。splitlines()

import re
tests = 'Haemoglobin 13.5 14-16 g/dl\nRandom Blood Sugar 186 60 - 160 mg/dl\n'\
    'Random Urine Sugar Nil\n¢ Blood Urea 43 14-40 mg/dl\n'\
    '4 — Serum Creatinine 2.13 0.4-1.5 mg/dl\n'\
    'Serum Uric Acid 4.9 3.4-7.0 mg/dl\nSerum Sodium 142 135 - 150 meq/L\n'\
    '/ Serum Potassium 2.6 3.5-5.0 meq/L\n'\
    'Total Cholesterol] 146 110 - 160 mg/dl\n'\
    'Triglycerides 162 60 - 180 mg/d]\n'

test_results = tests.splitlines()

for test_result in test_results:
    print('input :', test_result)

    m = re.search(r'.*?(?=[a-zA-Z][a-zA-Z])(?P<description>.*?)(?=[ ][0-9])'
              r'[ ](?P<result>[0-9.]*?)(?=[ ][0-9])'
              r'[ ](?P<normal_value>[ 0-9.\-]*?)(?=[ ][a-zA-Z])'
              r'[ ](?P<unit>.[ a-zA-Z/]*)',
              test_result)

    if m is not None:
        normal_value = m.group('normal_value')
        unit = m.group('unit')

    else:
        m = re.search(r'.*?(?=[a-zA-Z][a-zA-Z])(?P<description>.*?)(?=[ ]Nil)'
                  r'[ ](?P<result>Nil).*',
                  test_result)
        normal_value = ''
        unit = ''

    if m is not None:
        description = m.group('description')
        result = m.group('result')

    else:
        description = test_result
        result = ''

    write_string = description + ',' + result + ',' + normal_value + ',' + unit
    print(write_string)