Question

我使用Tesseract从扫描的PDF中提取文本。我有输出字符串就像这样..

Haemoglobin 13.5 14-16 g/dl
Random Blood Sugar 186 60 - 160 mg/dl
Random Urine Sugar Nil
¢ Blood Urea 43 14-40 mg/dl
4 — Serum Creatinine 2.13 0.4-1.5 mg/dl
Serum Uric Acid 4.9 3.4-7.0 mg/dl
Serum Sodium 142 135 - 150 meq/L
/ Serum Potassium 2.6 3.5-5.0 meq/L
Total Cholesterol] 146 110 - 160 mg/dl
Triglycerides 162 60 - 180 mg/d]

现在我必须将它提供给数据帧或csv，其中所有文本都在一列中，而值则在其他列中。

**Haemoglobin**            13.5   14-16     g/dl
**Random Blood Sugar**     186    60 - 160  mg/dl

到目前为止，我能做到的最好的就是这样......

  text = text.split('\n')
  text = [x.split(' ') for x in text]
df = pd.DataFrame(text, columns['Header','Detail','a','e,','b','c','d','f'])
df

    Header    Detail   a      e     b      c      d  f
0 Haemoglobin 13.5    14-16   g/dl  None   None  None  None
1 Random      Blood   Sugar   186   60      -     160  mg/dl
2 Random      Urine   Sugar   Nil   None   None  None  None

请帮助!!

Answer 1

我应该指出，这需要大量的工作，老实说，你还没有尝试过任何东西。但是为了帮助您在这里找到一个良好的代码，可以清除输入中的一些明显问题：

import re
def isnum(x):
    try:
        float(x)
        return True
    except:
        return False

def clean_line(lnin):
    # clean the leading garbage
    ln=re.sub('^[^A-Za-z]+','',lnin).split()
    for i in range(len(ln)):
        if isnum(ln[i]):
            ind=i
            break
    Header=' '.join(ln[:ind])
    ln=[Header]+ln[ind:]
    if '-' in ln:
        ind=ln.index('-')
        ln[ind-1]=ln[ind-1]+'-'+ln[ind+1]
        del ln[ind:ind+2]
    return ln

使用clean_line功能清除每一行。然后，您可以将其提供给数据帧。

Answer 2

从最后开始向后工作，因为记录的其余部分似乎是固定格式，即向后工作，

字符串表示单位（没有空格）：号码：破折号：号码：号码：你想要的文字

Haemoglobin 13.5 14-16 g/dl
Field 5 (all characters backwards from end until space reached) = g/gl
Field 4 (jump over space, all characters backwards until space or dash reached) = 16
Field 3 (jump over space if present, pick up dash) = -
Field 2 (jump over space if present, all characters backwards until space reached) = 14
Field 1 (jump over space, all characters backwards until space reached) = 13.5
Field 0 (jump over space and take the rest) = Haemoglobin

Total Cholesterol] 146 110 - 160 mg/dl
Field 5 (all characters backwards from end until space reached) = mg/dl
Field 4 (jump over space, all characters backwards until space or dash reached) = 160
Field 3 (jump over space if present, pick up dash) = -
Field 2 (jump over space if present, all characters backwards until space reached) = 110
Field 1 (jump over space, all characters backwards until space reached) = 146
Field 0 (jump over space and take the rest) = Total Cholesterol]

Answer 3

使用正则表达式，下面的示例代码将使用标记将文本解析为CSV字符串：description，result，normal_value，unit。

请注意，列表test_results通常使用以下命令从文件中读取：

以open（＆＃39; name_test_file＆＃39;）作为test_file： test_results = test_file.read（）。splitlines（）

import re
tests = 'Haemoglobin 13.5 14-16 g/dl\nRandom Blood Sugar 186 60 - 160 mg/dl\n'\
    'Random Urine Sugar Nil\n¢ Blood Urea 43 14-40 mg/dl\n'\
    '4 — Serum Creatinine 2.13 0.4-1.5 mg/dl\n'\
    'Serum Uric Acid 4.9 3.4-7.0 mg/dl\nSerum Sodium 142 135 - 150 meq/L\n'\
    '/ Serum Potassium 2.6 3.5-5.0 meq/L\n'\
    'Total Cholesterol] 146 110 - 160 mg/dl\n'\
    'Triglycerides 162 60 - 180 mg/d]\n'

test_results = tests.splitlines()

for test_result in test_results:
    print('input :', test_result)

    m = re.search(r'.*?(?=[a-zA-Z][a-zA-Z])(?P<description>.*?)(?=[ ][0-9])'
              r'[ ](?P<result>[0-9.]*?)(?=[ ][0-9])'
              r'[ ](?P<normal_value>[ 0-9.\-]*?)(?=[ ][a-zA-Z])'
              r'[ ](?P<unit>.[ a-zA-Z/]*)',
              test_result)

    if m is not None:
        normal_value = m.group('normal_value')
        unit = m.group('unit')

    else:
        m = re.search(r'.*?(?=[a-zA-Z][a-zA-Z])(?P<description>.*?)(?=[ ]Nil)'
                  r'[ ](?P<result>Nil).*',
                  test_result)
        normal_value = ''
        unit = ''

    if m is not None:
        description = m.group('description')
        result = m.group('result')

    else:
        description = test_result
        result = ''

    write_string = description + ',' + result + ',' + normal_value + ',' + unit
    print(write_string)

将扫描的PDF提取文本导入CSV

3 个答案: