根据行搜索子字符串值创建熊猫数据框

时间:2020-08-26 02:58:11

标签: python-3.x pandas dataframe

我正在努力将Python数据整形为数据框。谁能帮助我提供可能使我到达那里的代码?似乎最简单的解决方案是根据行中的文本子字符串创建列,但是我找不到文档来获得要从行中查找的形状。

Original Dataframe - no column headers, data all in rows

Desired Dataframe - bounding box rows to columns with uniform header, confidence to column

我的回复结构如下: { “ status”:“成功”, “ createdDateTime”:“ 2020-08-28T19:21:29Z”, “ lastUpdatedDateTime”:“ 2020-08-28T19:21:31Z”, “ analyzeResult”:{ “ version”:“ 3.0.0”, “ readResults”:[{ “页面”:1, “角度”:0.1296, “宽度”:1700, “高度”:2200, “单位”:“像素”, “行”:[{ “ boundingBox”:[ 182, 119, 383, 119, 383, 161, 182, 160 ], “ text”:“ FORM 101”, “字”:[{ “ boundingBox”:[ 183, 120 305, 120 305, 161, 182, 161 ], “ text”:“ FORM”, “信心”:0.987 }, { “ boundingBox”:[ 318, 120 381, 120 382, 162, 318, 161 ], “ text”:“ 101”, “信心”:0.987 } ] }, { “ boundingBox”:[ 578, 129, 1121, 129, 1121, 163, 578, 162 ], “ text”:“马萨诸塞州联邦”, “字”:[{ “ boundingBox”:[ 579, 129, 634, 129, 634, 162, 579, 161 ], “ text”:“ The”, “信心”:0.988 }, { “ boundingBox”:[ 641, 129, 868, 129, 866, 164, 640, 162 ], “ text”:“ Commonwealth”, “信心”:0.979 }, { “ boundingBox”:[ 874, 129, 902, 129, 900, 164, 872, 164 ], “ text”:“ of”, “信心”:0.988 }, { “ boundingBox”:[ 908, 129, 1120, 130, 1117, 163, 906, 164 ], “ text”:“ Massachusetts”, “信心”:0.977 } ] }, { “ boundingBox”:[ 1341, 137, 1540, 138, 1540, 164, 1341, 163 ], “ text”:“仅限DIA USE”, “字”:[{ “ boundingBox”:[ 1342, 138, 1392, 138, 1392, 164, 1341, 163 ], “ text”:“ DIA”, “信心”:0.983 }, { “ boundingBox”:[ 1397, 138, 1452, 139, 1452, 164, 1397, 164 ], “ text”:“ USE”, “信心”:0.983 }, { “ boundingBox”:[ 1457, 139, 1539年, 138, 1540, 164, 1457, 164 ], “纯文本”, “信心”:0.986 } ] }, { “ boundingBox”:[ 459, 169, 1235, 168, 1235, 202, 459, 203 ], “文本”:“工业事故部门-101部门”, “字”:[{ “ boundingBox”:[ 460, 170, 634, 170, 634, 203, 460, 204 ], “ text”:“部门”, “信心”:0.981 }, { “ boundingBox”:[ 640, 170, 669, 170, 669, 203, 640, 203 ], “ text”:“ of”, “信心”:0.983 }, { “ boundingBox”:[ 676, 170, 821, 169, 821, 203, 676, 203 ], “ text”:“工业”, “信心”:0.981 }, { “ boundingBox”:[ 828, 169, 967, 169, 966, 203, 828, 203 ], “ text”:“事故”, “信心”:0.952 }, { “ boundingBox”:[ 973, 169, 993, 169, 993, 203, 973, 203 ], “ text”:“-”, “信心”:0.983 }, { “ boundingBox”:[ 1000, 169, 1176, 169, 1176, 203, 999, 203 ], “ text”:“部门”, “信心”:0.982 }, { “ boundingBox”:[ 1183, 169, 1236, 169, 1235, 203, 1182, 203 ], “ text”:“ 101”, “信心”:0.987 } ] }, { “ boundingBox”:[ 511, 205, 1189, 205, 1189, 233, 511, 234 ], “文本”:“马萨诸塞州波士顿市国会街1号100号套房02114-2017”, “字”:[{ “ boundingBox”:[ 513, 206, 520, 206, 519, 233, 512, 233 ], “ text”:“ 1”, “信心”:0.974 }, { “ boundingBox”:[ 525, 206, 625, 206, 624, 234, 524, 233 ], “ text”:“国会”, “信心”:0.981 }, { “ boundingBox”:[ 630, 206, 702, 206, 701, 234, 629, 234 ], “ text”:“街道”, “信心”:0.977 }, { “ boundingBox”:[ 707, 206, 763, 206, 762, 234, 706, 234 ], “ text”:“ Suite”, “信心”:0.983 }, { “ boundingBox”:[ 769, 206, 812, 206, 811, 234, 767, 234 ], “ text”:“ 100”, “信心”:0.983 }, { “ boundingBox”:[ 818, 206, 898, 206, 897, 234, 816, 234 ], “ text”:“波士顿”, “信心”:0.983 }, { “ boundingBox”:[ 903, 206, 1059, 205, 1058, 234, 902, 234 ], “ text”:“ Massachusetts”, “信心”:0.975 }, { “ boundingBox”:[ 1064, 205, 1189, 205, 1187, 233, 1063, 234 ], “ text”:“ 02114-2017”, “信心”:0.978 } ] }, { “ boundingBox”:[ 422, 236, 1279, 237, 1279, 263, 422, 263 ], “文本”:“马萨诸塞州马萨诸塞州信息行800-323-3249分机470-617-727-4900分机470”, “字”:[{ “ boundingBox”:[ 423, 237, 472, 237, 472, 263, 422, 263 ], “ text”:“信息”, “信心”:0.983 }, { “ boundingBox”:[ 477, 237, 526, 237, 526, 264, 477, 264 ], “ text”:“ Line”, “信心”:0.986 }, { “ boundingBox”:[ 531, 237, 674, 237, 674, 264, 531, 264 ], “文本”:“ 800-323-3249”, “信心”:0.977 }, { “ boundingBox”:[ 679, 237, 718, 237, 718, 264, 679, 264 ], “ text”:“ ext。”, “信心”:0.982 }, { “ boundingBox”:[ 724, 237, 763, 237, 763, 264, 723, 264 ], “ text”:“ 470”, “信心”:0.986 }, { “ boundingBox”:[ 768, 237, 790, 237, 790, 264, 768, 264 ], “ text”:“ in”, “信心”:0.987 }, { “ boundingBox”:[ 795, 237, 865, 237, 865, 264, 795, 264 ], “ text”:“质量”, “信心”:0.983 }, { “ boundingBox”:[ 870, 237, 953, 237, 953, 264, 870, 264 ], “ text”:“外部”, “信心”:0.981 }, { “ boundingBox”:[ 958, 237, 1019, 237, 1020, 264, 958, 264 ], “ text”:“质量”, “信心”:0.984 }, { “ boundingBox”:[ 1025, 237, 1036, 237, 1037, 264, 1025, 264 ], “ text”:“-”, “信心”:0.983 }, { “ boundingBox”:[ 1042, 237, 1184, 237, 1185, 264, 1042, 264 ], “文本”:“ 617-727-4900”, “信心”:0.975 }, { “ boundingBox”:[ 1190, 237, 1229, 238, 1229, 264, 1190, 264 ], “ text”:“ ext。”, “信心”:0.985 }, { “ boundingBox”:[ 1234, 238, 1278, 238, 1278, 264, 1234, 264 ], “ text”:“ 470”, “信心”:0.983 } ] }, { “ boundingBox”:[ 716, 264, 984, 266, 984, 293, 715, 292 ], “ text”:“ http://www.mass.gov/dia”, “字”:[{ “ boundingBox”:[ 717, 265, 985, 267, 984, 294, 716, 293 ], “ text”:“ http://www.mass.gov/dia”, “信心”:0.952 }] }, { “ boundingBox”:[ 398, 299, 1289, 299, 1289, 342, 398, 342 ], “ text”:“雇主的第一次伤害报告”, “字”:[{ “ boundingBox”:[ 399, 300, 693, 300, 693, 341, 399, 343 ], “ text”:“ EMPLOYER'S”, “信心”:0.98 }, { “ boundingBox”:[ 702, 300, 836, 300, 836, 341, 702, 341 ], “ text”:“ FIRST”, “信心”:0.982 }, { “ boundingBox”:[ 845, 300, 1036, 300, 1036, 341, 844, 341 ], “ text”:“ REPORT”, “信心”:0.985 }, { “ boundingBox”:[ 1045, 300, 1105, 300, 1104, 342, 1044, 341 ], “ text”:“ OF”, “信心”:0.988 }, { “ boundingBox”:[ 1113, 300, 1288, 299, 1287, 343, 1113, 342 ], “ text”:“伤害”, “信心”:0.986 } ] }, { “ boundingBox”:[ 691, 354, 1005, 355, 1005, 395, 691, 393 ], “ text”:“ OR FATALITY”, “字”:[{ “ boundingBox”:[ 691, 354, 760, 355, 760, 395, 692, 394 ], “ text”:“ OR”, “信心”:0.988 }, { “ boundingBox”:[ 768, 355, 1005, 356, 1003, 395, 768, 395 ], “ text”:“ FATALITY”, “信心”:0.981 } ] } ] }] } }

1 个答案:

答案 0 :(得分:0)

如果不提供数据或没有说明,这基本上可以满足您的需求。

  1. 评论解释方法
  2. linekey 上还有更多工作要做,但是我看不到实际数据与您以图像形式发布的结果之间的关系
import re
import numpy as np
import pandas as pd
df = pd.DataFrame(
{0:["analyzeResult_readResults_0_lines_0_text","analyzeResult_readResults_0_lines_0_words_0_boundingBox_0","analyzeResult_readResults_0_lines_0_words_0_boundingBox_1","analyzeResult_readResults_0_lines_0_words_0_boundingBox_2","analyzeResult_readResults_0_lines_0_words_0_boundingBox_3","analyzeResult_readResults_0_lines_0_words_0_boundingBox_4","analyzeResult_readResults_0_lines_0_words_0_boundingBox_5","analyzeResult_readResults_0_lines_0_words_0_boundingBox_6","analyzeResult_readResults_0_lines_0_words_0_boundingBox_7","analyzeResult_readResults_0_lines_0_words_0_text","analyzeResult_readResults_0_lines_0_words_0_confidence","analyzeResult_readResults_0_lines_0_words_1_boundingBox_0","analyzeResult_readResults_0_lines_0_words_1_boundingBox_1","analyzeResult_readResults_0_lines_0_words_1_boundingBox_2","analyzeResult_readResults_0_lines_0_words_1_boundingBox_3","analyzeResult_readResults_0_lines_0_words_1_boundingBox_4","analyzeResult_readResults_0_lines_0_words_1_boundingBox_5","analyzeResult_readResults_0_lines_0_words_1_boundingBox_6","analyzeResult_readResults_0_lines_0_words_1_boundingBox_7","analyzeResult_readResults_0_lines_0_words_1_text","analyzeResult_readResults_0_lines_0_words_1_confidence","analyzeResult_readResults_0_lines_1_boundingBox_0","analyzeResult_readResults_0_lines_1_boundingBox_1","analyzeResult_readResults_0_lines_1_boundingBox_2","analyzeResult_readResults_0_lines_1_boundingBox_3","analyzeResult_readResults_0_lines_1_boundingBox_4","analyzeResult_readResults_0_lines_1_boundingBox_5","analyzeResult_readResults_0_lines_1_boundingBox_6","analyzeResult_readResults_0_lines_1_boundingBox_7"],

 1:["FORM 101",183,120,305,120,305,161,182,161,"FORM",0.987,318,120,381,120,382,162,318,161,101,0.987,578,129,1121,129,1121,163,578,162],
},
 index=[17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45]
)

df = (
df
    .rename(columns={0:"key",1:"val"})
    .assign(
        b=lambda x: x["key"].str.extract("(.*)_bounding"),
        c=lambda x: x["key"].str.extract("(.*)_confidence"),
        # linekey is everything before "_bounding" or "_confidence". pull the two together
        linekey=lambda x: np.where(x["b"].isna(), 
                             np.where(x["c"].isna(), x["key"], x["c"]), 
                             x["b"]),
        # column key is every thing after line key minus leading "_"
        colkey=lambda x: x.apply(lambda r: r["key"].replace(r["linekey"], "").strip("_"), axis=1)
    )
    .assign(
        # cleanup special case line keys...
        colkey=lambda x: np.where(x["colkey"]=="", "Value", x["colkey"].replace("confidence","Confidence"))
    )
    # remove working columns
    .drop(columns=["b","c","key"])
    # mixed values and strings so use "first" and unstack to get to desired layout
    .groupby(["linekey","colkey"]).agg({"val":"first"}).unstack()
)

print(df.to_string())

输出

                                                        val                                                                                                                          
colkey                                           Confidence     Value boundingBox_0 boundingBox_1 boundingBox_2 boundingBox_3 boundingBox_4 boundingBox_5 boundingBox_6 boundingBox_7
linekey                                                                                                                                                                              
analyzeResult_readResults_0_lines_0_text                NaN  FORM 101           NaN           NaN           NaN           NaN           NaN           NaN           NaN           NaN
analyzeResult_readResults_0_lines_0_words_0           0.987       NaN           183           120           305           120           305           161           182           161
analyzeResult_readResults_0_lines_0_words_0_text        NaN      FORM           NaN           NaN           NaN           NaN           NaN           NaN           NaN           NaN
analyzeResult_readResults_0_lines_0_words_1           0.987       NaN           318           120           381           120           382           162           318           161
analyzeResult_readResults_0_lines_0_words_1_text        NaN       101           NaN           NaN           NaN           NaN           NaN           NaN           NaN           NaN
analyzeResult_readResults_0_lines_1                     NaN       NaN           578           129          1121           129          1121           163           578           162