Question

我正在努力将Python数据整形为数据框。谁能帮助我提供可能使我到达那里的代码？似乎最简单的解决方案是根据行中的文本子字符串创建列，但是我找不到文档来获得要从行中查找的形状。

Original Dataframe - no column headers, data all in rows

Desired Dataframe - bounding box rows to columns with uniform header, confidence to column

我的回复结构如下： { “ status”：“成功”， “ createdDateTime”：“ 2020-08-28T19：21：29Z”， “ lastUpdatedDateTime”：“ 2020-08-28T19：21：31Z”， “ analyzeResult”：{ “ version”：“ 3.0.0”， “ readResults”：[{ “页面”：1， “角度”：0.1296， “宽度”：1700， “高度”：2200， “单位”：“像素”， “行”：[{ “ boundingBox”：[ 182， 119， 383， 119， 383， 161， 182， 160 ]， “ text”：“ FORM 101”， “字”：[{ “ boundingBox”：[ 183， 120 305， 120 305， 161， 182， 161 ]， “ text”：“ FORM”， “信心”：0.987 }， { “ boundingBox”：[ 318， 120 381， 120 382， 162， 318， 161 ]， “ text”：“ 101”， “信心”：0.987 } ] }， { “ boundingBox”：[ 578， 129， 1121， 129， 1121， 163， 578， 162 ]， “ text”：“马萨诸塞州联邦”， “字”：[{ “ boundingBox”：[ 579， 129， 634， 129， 634， 162， 579， 161 ]， “ text”：“ The”， “信心”：0.988 }， { “ boundingBox”：[ 641， 129， 868， 129， 866， 164， 640， 162 ]， “ text”：“ Commonwealth”， “信心”：0.979 }， { “ boundingBox”：[ 874， 129， 902， 129， 900， 164， 872， 164 ]， “ text”：“ of”， “信心”：0.988 }， { “ boundingBox”：[ 908， 129， 1120， 130， 1117， 163， 906， 164 ]， “ text”：“ Massachusetts”， “信心”：0.977 } ] }， { “ boundingBox”：[ 1341， 137， 1540， 138， 1540， 164， 1341， 163 ]， “ text”：“仅限DIA USE”， “字”：[{ “ boundingBox”：[ 1342， 138， 1392， 138， 1392， 164， 1341， 163 ]， “ text”：“ DIA”， “信心”：0.983 }， { “ boundingBox”：[ 1397， 138， 1452， 139， 1452， 164， 1397， 164 ]， “ text”：“ USE”， “信心”：0.983 }， { “ boundingBox”：[ 1457， 139， 1539年， 138， 1540， 164， 1457， 164 ]， “纯文本”， “信心”：0.986 } ] }， { “ boundingBox”：[ 459， 169， 1235， 168， 1235， 202， 459， 203 ]， “文本”：“工业事故部门-101部门”， “字”：[{ “ boundingBox”：[ 460， 170， 634， 170， 634， 203， 460， 204 ]， “ text”：“部门”， “信心”：0.981 }， { “ boundingBox”：[ 640， 170， 669， 170， 669， 203， 640， 203 ]， “ text”：“ of”， “信心”：0.983 }， { “ boundingBox”：[ 676， 170， 821， 169， 821， 203， 676， 203 ]， “ text”：“工业”， “信心”：0.981 }， { “ boundingBox”：[ 828， 169， 967， 169， 966， 203， 828， 203 ]， “ text”：“事故”， “信心”：0.952 }， { “ boundingBox”：[ 973， 169， 993， 169， 993， 203， 973， 203 ]， “ text”：“-”， “信心”：0.983 }， { “ boundingBox”：[ 1000， 169， 1176， 169， 1176， 203， 999， 203 ]， “ text”：“部门”， “信心”：0.982 }， { “ boundingBox”：[ 1183， 169， 1236， 169， 1235， 203， 1182， 203 ]， “ text”：“ 101”， “信心”：0.987 } ] }， { “ boundingBox”：[ 511， 205， 1189， 205， 1189， 233， 511， 234 ]， “文本”：“马萨诸塞州波士顿市国会街1号100号套房02114-2017”， “字”：[{ “ boundingBox”：[ 513， 206， 520， 206， 519， 233， 512， 233 ]， “ text”：“ 1”， “信心”：0.974 }， { “ boundingBox”：[ 525， 206， 625， 206， 624， 234， 524， 233 ]， “ text”：“国会”， “信心”：0.981 }， { “ boundingBox”：[ 630， 206， 702， 206， 701， 234， 629， 234 ]， “ text”：“街道”， “信心”：0.977 }， { “ boundingBox”：[ 707， 206， 763， 206， 762， 234， 706， 234 ]， “ text”：“ Suite”， “信心”：0.983 }， { “ boundingBox”：[ 769， 206， 812， 206， 811， 234， 767， 234 ]， “ text”：“ 100”， “信心”：0.983 }， { “ boundingBox”：[ 818， 206， 898， 206， 897， 234， 816， 234 ]， “ text”：“波士顿”， “信心”：0.983 }， { “ boundingBox”：[ 903， 206， 1059， 205， 1058， 234， 902， 234 ]， “ text”：“ Massachusetts”， “信心”：0.975 }， { “ boundingBox”：[ 1064， 205， 1189， 205， 1187， 233， 1063， 234 ]， “ text”：“ 02114-2017”， “信心”：0.978 } ] }， { “ boundingBox”：[ 422， 236， 1279， 237， 1279， 263， 422， 263 ]， “文本”：“马萨诸塞州马萨诸塞州信息行800-323-3249分机470-617-727-4900分机470”， “字”：[{ “ boundingBox”：[ 423， 237， 472， 237， 472， 263， 422， 263 ]， “ text”：“信息”， “信心”：0.983 }， { “ boundingBox”：[ 477， 237， 526， 237， 526， 264， 477， 264 ]， “ text”：“ Line”， “信心”：0.986 }， { “ boundingBox”：[ 531， 237， 674， 237， 674， 264， 531， 264 ]， “文本”：“ 800-323-3249”， “信心”：0.977 }， { “ boundingBox”：[ 679， 237， 718， 237， 718， 264， 679， 264 ]， “ text”：“ ext。”， “信心”：0.982 }， { “ boundingBox”：[ 724， 237， 763， 237， 763， 264， 723， 264 ]， “ text”：“ 470”， “信心”：0.986 }， { “ boundingBox”：[ 768， 237， 790， 237， 790， 264， 768， 264 ]， “ text”：“ in”， “信心”：0.987 }， { “ boundingBox”：[ 795， 237， 865， 237， 865， 264， 795， 264 ]， “ text”：“质量”， “信心”：0.983 }， { “ boundingBox”：[ 870， 237， 953， 237， 953， 264， 870， 264 ]， “ text”：“外部”， “信心”：0.981 }， { “ boundingBox”：[ 958， 237， 1019， 237， 1020， 264， 958， 264 ]， “ text”：“质量”， “信心”：0.984 }， { “ boundingBox”：[ 1025， 237， 1036， 237， 1037， 264， 1025， 264 ]， “ text”：“-”， “信心”：0.983 }， { “ boundingBox”：[ 1042， 237， 1184， 237， 1185， 264， 1042， 264 ]， “文本”：“ 617-727-4900”， “信心”：0.975 }， { “ boundingBox”：[ 1190， 237， 1229， 238， 1229， 264， 1190， 264 ]， “ text”：“ ext。”， “信心”：0.985 }， { “ boundingBox”：[ 1234， 238， 1278， 238， 1278， 264， 1234， 264 ]， “ text”：“ 470”， “信心”：0.983 } ] }， { “ boundingBox”：[ 716， 264， 984， 266， 984， 293， 715， 292 ]， “ text”：“ http://www.mass.gov/dia”， “字”：[{ “ boundingBox”：[ 717， 265， 985， 267， 984， 294， 716， 293 ]， “ text”：“ http://www.mass.gov/dia”， “信心”：0.952 }] }， { “ boundingBox”：[ 398， 299， 1289， 299， 1289， 342， 398， 342 ]， “ text”：“雇主的第一次伤害报告”， “字”：[{ “ boundingBox”：[ 399， 300， 693， 300， 693， 341， 399， 343 ]， “ text”：“ EMPLOYER'S”， “信心”：0.98 }， { “ boundingBox”：[ 702， 300， 836， 300， 836， 341， 702， 341 ]， “ text”：“ FIRST”， “信心”：0.982 }， { “ boundingBox”：[ 845， 300， 1036， 300， 1036， 341， 844， 341 ]， “ text”：“ REPORT”， “信心”：0.985 }， { “ boundingBox”：[ 1045， 300， 1105， 300， 1104， 342， 1044， 341 ]， “ text”：“ OF”， “信心”：0.988 }， { “ boundingBox”：[ 1113， 300， 1288， 299， 1287， 343， 1113， 342 ]， “ text”：“伤害”， “信心”：0.986 } ] }， { “ boundingBox”：[ 691， 354， 1005， 355， 1005， 395， 691， 393 ]， “ text”：“ OR FATALITY”， “字”：[{ “ boundingBox”：[ 691， 354， 760， 355， 760， 395， 692， 394 ]， “ text”：“ OR”， “信心”：0.988 }， { “ boundingBox”：[ 768， 355， 1005， 356， 1003， 395， 768， 395 ]， “ text”：“ FATALITY”， “信心”：0.981 } ] } ] }] } }

Answer 1

如果不提供数据或没有说明，这基本上可以满足您的需求。

评论解释方法
在 linekey 上还有更多工作要做，但是我看不到实际数据与您以图像形式发布的结果之间的关系

import re
import numpy as np
import pandas as pd
df = pd.DataFrame(
{0:["analyzeResult_readResults_0_lines_0_text","analyzeResult_readResults_0_lines_0_words_0_boundingBox_0","analyzeResult_readResults_0_lines_0_words_0_boundingBox_1","analyzeResult_readResults_0_lines_0_words_0_boundingBox_2","analyzeResult_readResults_0_lines_0_words_0_boundingBox_3","analyzeResult_readResults_0_lines_0_words_0_boundingBox_4","analyzeResult_readResults_0_lines_0_words_0_boundingBox_5","analyzeResult_readResults_0_lines_0_words_0_boundingBox_6","analyzeResult_readResults_0_lines_0_words_0_boundingBox_7","analyzeResult_readResults_0_lines_0_words_0_text","analyzeResult_readResults_0_lines_0_words_0_confidence","analyzeResult_readResults_0_lines_0_words_1_boundingBox_0","analyzeResult_readResults_0_lines_0_words_1_boundingBox_1","analyzeResult_readResults_0_lines_0_words_1_boundingBox_2","analyzeResult_readResults_0_lines_0_words_1_boundingBox_3","analyzeResult_readResults_0_lines_0_words_1_boundingBox_4","analyzeResult_readResults_0_lines_0_words_1_boundingBox_5","analyzeResult_readResults_0_lines_0_words_1_boundingBox_6","analyzeResult_readResults_0_lines_0_words_1_boundingBox_7","analyzeResult_readResults_0_lines_0_words_1_text","analyzeResult_readResults_0_lines_0_words_1_confidence","analyzeResult_readResults_0_lines_1_boundingBox_0","analyzeResult_readResults_0_lines_1_boundingBox_1","analyzeResult_readResults_0_lines_1_boundingBox_2","analyzeResult_readResults_0_lines_1_boundingBox_3","analyzeResult_readResults_0_lines_1_boundingBox_4","analyzeResult_readResults_0_lines_1_boundingBox_5","analyzeResult_readResults_0_lines_1_boundingBox_6","analyzeResult_readResults_0_lines_1_boundingBox_7"],

 1:["FORM 101",183,120,305,120,305,161,182,161,"FORM",0.987,318,120,381,120,382,162,318,161,101,0.987,578,129,1121,129,1121,163,578,162],
},
 index=[17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45]
)

df = (
df
    .rename(columns={0:"key",1:"val"})
    .assign(
        b=lambda x: x["key"].str.extract("(.*)_bounding"),
        c=lambda x: x["key"].str.extract("(.*)_confidence"),
        # linekey is everything before "_bounding" or "_confidence". pull the two together
        linekey=lambda x: np.where(x["b"].isna(), 
                             np.where(x["c"].isna(), x["key"], x["c"]), 
                             x["b"]),
        # column key is every thing after line key minus leading "_"
        colkey=lambda x: x.apply(lambda r: r["key"].replace(r["linekey"], "").strip("_"), axis=1)
    )
    .assign(
        # cleanup special case line keys...
        colkey=lambda x: np.where(x["colkey"]=="", "Value", x["colkey"].replace("confidence","Confidence"))
    )
    # remove working columns
    .drop(columns=["b","c","key"])
    # mixed values and strings so use "first" and unstack to get to desired layout
    .groupby(["linekey","colkey"]).agg({"val":"first"}).unstack()
)

print(df.to_string())

输出

                                                        val                                                                                                                          
colkey                                           Confidence     Value boundingBox_0 boundingBox_1 boundingBox_2 boundingBox_3 boundingBox_4 boundingBox_5 boundingBox_6 boundingBox_7
linekey                                                                                                                                                                              
analyzeResult_readResults_0_lines_0_text                NaN  FORM 101           NaN           NaN           NaN           NaN           NaN           NaN           NaN           NaN
analyzeResult_readResults_0_lines_0_words_0           0.987       NaN           183           120           305           120           305           161           182           161
analyzeResult_readResults_0_lines_0_words_0_text        NaN      FORM           NaN           NaN           NaN           NaN           NaN           NaN           NaN           NaN
analyzeResult_readResults_0_lines_0_words_1           0.987       NaN           318           120           381           120           382           162           318           161
analyzeResult_readResults_0_lines_0_words_1_text        NaN       101           NaN           NaN           NaN           NaN           NaN           NaN           NaN           NaN
analyzeResult_readResults_0_lines_1                     NaN       NaN           578           129          1121           129          1121           163           578           162

根据行搜索子字符串值创建熊猫数据框

1 个答案: