从结构不良的XML文件中提取数据

时间:2017-02-09 21:29:20

标签: xml sed grep cygwin data-science

我正在使用生理数据,并且具有非常大的,格式不正确的XML文件,其中包含分析所需的所有原始数据。我们之前在autohotkey中使用了一个正则表达式样式方法来提取数据,但它表现不佳。我一直在寻找解析或提取数据的方法 有效地从这些文件。

我遇到的问题是XML文件结构无效或者无法构建一个好的sed命令来按照我们需要的方式提取和拼凑数据。也就是说整体数据结构的可变性很小,所以我认为正则表达式可以用来有效地从文件中提取数据。到目前为止,我一直在尝试使用grep / sed和各种XML解析工具。

Input: XML (poorly structured, 500-1500 mb)

Desired Output: CSV

VitalSigns:

1 VitalSigns, Date, Time, Par, Value
2 VitalSigns, Date, Time, Par, Value
...
end

Waveforms:

1 Waveforms, Date, Time, Waveform Channel, value 1, value 2, ... , value 480
2 Waveforms, Date, Time, Waveform Channel, value 1, value 2, ... , value 480
...
end

以下是XML文件的示例: Dropbox link for sample file

      <VitalSigns Time="01/27/2017 10:33:04 PM">
            <VS>
                  <Par>PVC</Par>
                  <Value UOM="Bpm">0</Value>
                  <AlarmLimitHigh Label="PVC HI">6</AlarmLimitHigh>
            </VS>
            <VS>
                  <Par>RESP</Par>
                  <Value UOM="BrMin">68</Value>
                  <AlarmLimitLow Label="RESP LO">12</AlarmLimitLow>
                  <AlarmLimitHigh Label="RESP HI">60</AlarmLimitHigh>
            </VS>
            <VS>
                  <Par>SPO2-%</Par>
                  <Value UOM="%" Q="2">100</Value>
                  <AlarmLimitLow Label="SPO2-% LO">93</AlarmLimitLow>
                  <AlarmLimitHigh Label="SPO2-% HI">101</AlarmLimitHigh>
            </VS>
      </VitalSigns>
      <Waveforms Time="01/27/2017 10:33:04 PM">
            <WaveformData Channel="I" UOM="Uncalib">347,342,334,332,332,332,331,333,335,335,335,335,335,335,338,343,347,347,346,348,348,347,345,342,341,341,343,342,340,335,332,328,323,312,298,286,281,279,279,281,286,289,293,299,307,317,331,344,359,373,382,385,379,365,345,328,317,313,309,304,300,297,294,292,290,288,289,289,290,284,270,265,307,424,586,667,538,262,49,30,136,237,286,294,288,282,280,278,276,274,276,276,275,272,269,267,265,262,260,259,263,271,281,290,301,312,316,315,310,304,297,290,282,276,269,259,251,243,237,234,233,234,237,239,242,243,243,244,244,247,254,264,271,273,274,275,271,259,239,221,214,214,213,209,205,203,202,201,200,197,194,191,188,176,162,176,259,409,532,488,254,-4,-93,-13,110,191,222,223,219,218,217,216,215,217,219,220,222,223,224,227,229,232,235,238,240,240,240,245,249,251,251,248,246,244,244,243,240,235,229,222,215,209,207,208,214,221,226,230,233,234,235,239,243,247,253,261,270,275,280,283,281,269,248,228,217,214,214,213,214,217,223,229,233,238,242,249,255,254,248,265,343,489,621,595,373,107,-7,55,169,244,269,268,262,260,262,265,268,269,269,270,271,273,276,278,281,282,287,290,294,296,299,304,308,310,309,308,307,307,307,306,302,294,286,278,271,262,256,253,254,263,278,293,307,320,334,345,350,350,349,349,353,360,366,371,374,370,360,346,333,325,323,323,322,319,317,315,313,314,314,315,316,317,313,308,326,405,554,701,694,478,194,56,109,231,315,343,341,333,331,333,336,338,339,338,339,338,336,332,327,325,329,337,345,351,360,374,388,397,399,398,394,391,387,382,378,373,367,362,357,352,346,342,339,339,340,343,347,353,356,358,360,360,359,360,360,361,362,364,367,375,383,390,392,388,377,362,347,342,342,342,342,340,338,337,336,335,334,334,332,330,321,318,352,453,611,718,649,408,155,63,146,273,354,379,374,365,360,359,359,360,361,364,365,364,362,359,357,355,354,355,355,357,358,358,358,358,357,357,356,355,353,351,347,343,338,330,324,319,315,312,310,310,314,319,325,330,335,339,342,342,339,336,334,333,334,336,340,349,361,372</WaveformData>
            <WaveformData Channel="II" UOM="Uncalib">347,342,334,332,332,332,331,333,335,335,335,335,335,335,338,343,347,347,346,348,348,347,345,342,341,341,343,342,340,335,332,328,323,312,298,286,281,279,279,281,286,289,293,299,307,317,331,344,359,373,382,385,379,365,345,328,317,313,309,304,300,297,294,292,290,288,289,289,290,284,270,265,307,424,586,667,538,262,49,30,136,237,286,294,288,282,280,278,276,274,276,276,275,272,269,267,265,262,260,259,263,271,281,290,301,312,316,315,310,304,297,290,282,276,269,259,251,243,237,234,233,234,237,239,242,243,243,244,244,247,254,264,271,273,274,275,271,259,239,221,214,214,213,209,205,203,202,201,200,197,194,191,188,176,162,176,259,409,532,488,254,-4,-93,-13,110,191,222,223,219,218,217,216,215,217,219,220,222,223,224,227,229,232,235,238,240,240,240,245,249,251,251,248,246,244,244,243,240,235,229,222,215,209,207,208,214,221,226,230,233,234,235,239,243,247,253,261,270,275,280,283,281,269,248,228,217,214,214,213,214,217,223,229,233,238,242,249,255,254,248,265,343,489,621,595,373,107,-7,55,169,244,269,268,262,260,262,265,268,269,269,270,271,273,276,278,281,282,287,290,294,296,299,304,308,310,309,308,307,307,307,306,302,294,286,278,271,262,256,253,254,263,278,293,307,320,334,345,350,350,349,349,353,360,366,371,374,370,360,346,333,325,323,323,322,319,317,315,313,314,314,315,316,317,313,308,326,405,554,701,694,478,194,56,109,231,315,343,341,333,331,333,336,338,339,338,339,338,336,332,327,325,329,337,345,351,360,374,388,397,399,398,394,391,387,382,378,373,367,362,357,352,346,342,339,339,340,343,347,353,356,358,360,360,359,360,360,361,362,364,367,375,383,390,392,388,377,362,347,342,342,342,342,340,338,337,336,335,334,334,332,330,321,318,352,453,611,718,649,408,155,63,146,273,354,379,374,365,360,359,359,360,361,364,365,364,362,359,357,355,354,355,355,357,358,358,358,358,357,357,356,355,353,351,347,343,338,330,324,319,315,312,310,310,314,319,325,330,335,339,342,342,339,336,334,333,334,336,340,349,361,372</WaveformData>
            <WaveformData Channel="SPO2" UOM="Uncalib">370,741,741,1112,1117,1122,1122,1128,1128,1128,1128,1128,1138,1149,1149,1160,1160,1160,1160,1160,1176,1192,1192,1208,1208,1208,1208,1208,1216,1224,1224,1232,1232,1232,1232,1232,1242,1253,1253,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1277,1290,1290,1304,1304,1304,1304,1304,1314,1325,1325,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1338,1341,1341,1344,1344,1344,1344,1344,1338,1333,1333,1328,1328,1328,1328,1328,1325,1322,1322,1320,1320,1320,1320,1320,1330,1341,1341,1352,1352,1352,1352,1352,1373,1394,1394,1416,1416,1416,1416,1416,1434,1453,1453,1472,1472,1472,1472,1472,1485,1498,1498,1512,1512,1512,1512,1512,1520,1528,1528,1536,1536,1536,1536,1536,1533,1530,1530,1528,1528,1528,1528,1528,1514,1501,1501,1488,1488,1488,1488,1488,1464,1440,1440,1416,1416,1416,1416,1416,1397,1378,1378,1360,1360,1360,1360,1360,1341,1322,1322,1304,1304,1304,1304,1304,1306,1309,1309,1312,1312,1312,1312,1312,1336,1360,1360,1384,1384,1384,1384,1384,1413,1442,1442,1472,1472,1472,1472,1472,1477,1482,1482,1488,1488,1488,1488,1488,1480,1472,1472,1464,1464,1464,1464,1464,1448,1432,1432,1416,1416,1416,1416,1416,1389,1362,1362,1336,1336,1336,1336,1336,1322,1309,1309,1296,1296,1296,1296,1296,1290,1285,1285,1280,1280,1280,1280,1280,1285,1290,1290,1296,1296,1296,1296,1296,1314,1333,1333,1352,1352,1352,1352,1352,1370,1389,1389,1408,1408,1408,1408,1408,1429,1450,1450,1472,1474,1477,1477,1480,1477,1474,1474,1472,1472,1472,1472,1472,1458,1445,1445,1432,1432,1432,1432,1432,1408,1384,1384,1360,1360,1360,1360,1360,1341,1322,1322,1304,1304,1304,1304,1304,1282,1261,1261,1240,1240,1240,1240,1240,1218,1197,1197,1176,1176,1176,1176,1176,1168,1160,1160,1152,1152,1152,1152,1152,1154,1157,1157,1160,1160,1160,1160,1160,1184,1208,1208,1232,1232,1232,1232,1232,1253,1274,1274,1296,1296,1296,1296,1296,1298,1301,1301,1304,1304,1304,1304,1304,1293,1282,1282,1272,1272,1272,1272,1272,1256,1240,1240,1224,1224,1224,1224,1224,1194,1165,1165,1136,1136,1136,1136,1136,1117,1098,1098,1080,1080,1080,1080,1080,1056,1032,1032,1008,1008,1008,1008,1008,994,981,981,968,968,968,968,968,960,952,952,944,944,944,944,944,960,976,976,992,992,992,992,992,1024,1056,1056,1088,1088,1088,1088,1088,1106,1125,1125,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1128,1112,1112,1096,1096,1096,1096,1096,1053,1010,1010,968,968,968,968,968,928,888,888,848</WaveformData>
      </Waveforms>
  <VitalSigns Time="01/27/2017 10:33:06 PM">
        <VS>
              <Par>PVC</Par>
              <Value UOM="Bpm">0</Value>
              <AlarmLimitHigh Label="PVC HI">6</AlarmLimitHigh>
        </VS>
        <VS>
              <Par>RESP</Par>
              <Value UOM="BrMin">55</Value>
              <AlarmLimitLow Label="RESP LO">12</AlarmLimitLow>
              <AlarmLimitHigh Label="RESP HI">60</AlarmLimitHigh>
        </VS>
        <VS>
              <Par>SPO2-%</Par>
              <Value UOM="%" Q="2">99</Value>
              <AlarmLimitLow Label="SPO2-% LO">93</AlarmLimitLow>
              <AlarmLimitHigh Label="SPO2-% HI">101</AlarmLimitHigh>
        </VS>
  </VitalSigns>
<Waveforms Time="01/27/2017 10:33:06 PM">
            <WaveformData Channel="I" UOM="Uncalib">347,342,334,332,332,332,331,333,335,335,335,335,335,335,338,343,347,347,346,348,348,347,345,342,341,341,343,342,340,335,332,328,323,312,298,286,281,279,279,281,286,289,293,299,307,317,331,344,359,373,382,385,379,365,345,328,317,313,309,304,300,297,294,292,290,288,289,289,290,284,270,265,307,424,586,667,538,262,49,30,136,237,286,294,288,282,280,278,276,274,276,276,275,272,269,267,265,262,260,259,263,271,281,290,301,312,316,315,310,304,297,290,282,276,269,259,251,243,237,234,233,234,237,239,242,243,243,244,244,247,254,264,271,273,274,275,271,259,239,221,214,214,213,209,205,203,202,201,200,197,194,191,188,176,162,176,259,409,532,488,254,-4,-93,-13,110,191,222,223,219,218,217,216,215,217,219,220,222,223,224,227,229,232,235,238,240,240,240,245,249,251,251,248,246,244,244,243,240,235,229,222,215,209,207,208,214,221,226,230,233,234,235,239,243,247,253,261,270,275,280,283,281,269,248,228,217,214,214,213,214,217,223,229,233,238,242,249,255,254,248,265,343,489,621,595,373,107,-7,55,169,244,269,268,262,260,262,265,268,269,269,270,271,273,276,278,281,282,287,290,294,296,299,304,308,310,309,308,307,307,307,306,302,294,286,278,271,262,256,253,254,263,278,293,307,320,334,345,350,350,349,349,353,360,366,371,374,370,360,346,333,325,323,323,322,319,317,315,313,314,314,315,316,317,313,308,326,405,554,701,694,478,194,56,109,231,315,343,341,333,331,333,336,338,339,338,339,338,336,332,327,325,329,337,345,351,360,374,388,397,399,398,394,391,387,382,378,373,367,362,357,352,346,342,339,339,340,343,347,353,356,358,360,360,359,360,360,361,362,364,367,375,383,390,392,388,377,362,347,342,342,342,342,340,338,337,336,335,334,334,332,330,321,318,352,453,611,718,649,408,155,63,146,273,354,379,374,365,360,359,359,360,361,364,365,364,362,359,357,355,354,355,355,357,358,358,358,358,357,357,356,355,353,351,347,343,338,330,324,319,315,312,310,310,314,319,325,330,335,339,342,342,339,336,334,333,334,336,340,349,361,372</WaveformData>
            <WaveformData Channel="II" UOM="Uncalib">347,342,334,332,332,332,331,333,335,335,335,335,335,335,338,343,347,347,346,348,348,347,345,342,341,341,343,342,340,335,332,328,323,312,298,286,281,279,279,281,286,289,293,299,307,317,331,344,359,373,382,385,379,365,345,328,317,313,309,304,300,297,294,292,290,288,289,289,290,284,270,265,307,424,586,667,538,262,49,30,136,237,286,294,288,282,280,278,276,274,276,276,275,272,269,267,265,262,260,259,263,271,281,290,301,312,316,315,310,304,297,290,282,276,269,259,251,243,237,234,233,234,237,239,242,243,243,244,244,247,254,264,271,273,274,275,271,259,239,221,214,214,213,209,205,203,202,201,200,197,194,191,188,176,162,176,259,409,532,488,254,-4,-93,-13,110,191,222,223,219,218,217,216,215,217,219,220,222,223,224,227,229,232,235,238,240,240,240,245,249,251,251,248,246,244,244,243,240,235,229,222,215,209,207,208,214,221,226,230,233,234,235,239,243,247,253,261,270,275,280,283,281,269,248,228,217,214,214,213,214,217,223,229,233,238,242,249,255,254,248,265,343,489,621,595,373,107,-7,55,169,244,269,268,262,260,262,265,268,269,269,270,271,273,276,278,281,282,287,290,294,296,299,304,308,310,309,308,307,307,307,306,302,294,286,278,271,262,256,253,254,263,278,293,307,320,334,345,350,350,349,349,353,360,366,371,374,370,360,346,333,325,323,323,322,319,317,315,313,314,314,315,316,317,313,308,326,405,554,701,694,478,194,56,109,231,315,343,341,333,331,333,336,338,339,338,339,338,336,332,327,325,329,337,345,351,360,374,388,397,399,398,394,391,387,382,378,373,367,362,357,352,346,342,339,339,340,343,347,353,356,358,360,360,359,360,360,361,362,364,367,375,383,390,392,388,377,362,347,342,342,342,342,340,338,337,336,335,334,334,332,330,321,318,352,453,611,718,649,408,155,63,146,273,354,379,374,365,360,359,359,360,361,364,365,364,362,359,357,355,354,355,355,357,358,358,358,358,357,357,356,355,353,351,347,343,338,330,324,319,315,312,310,310,314,319,325,330,335,339,342,342,339,336,334,333,334,336,340,349,361,372</WaveformData>
            <WaveformData Channel="SPO2" UOM="Uncalib">370,741,741,1112,1117,1122,1122,1128,1128,1128,1128,1128,1138,1149,1149,1160,1160,1160,1160,1160,1176,1192,1192,1208,1208,1208,1208,1208,1216,1224,1224,1232,1232,1232,1232,1232,1242,1253,1253,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1277,1290,1290,1304,1304,1304,1304,1304,1314,1325,1325,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1338,1341,1341,1344,1344,1344,1344,1344,1338,1333,1333,1328,1328,1328,1328,1328,1325,1322,1322,1320,1320,1320,1320,1320,1330,1341,1341,1352,1352,1352,1352,1352,1373,1394,1394,1416,1416,1416,1416,1416,1434,1453,1453,1472,1472,1472,1472,1472,1485,1498,1498,1512,1512,1512,1512,1512,1520,1528,1528,1536,1536,1536,1536,1536,1533,1530,1530,1528,1528,1528,1528,1528,1514,1501,1501,1488,1488,1488,1488,1488,1464,1440,1440,1416,1416,1416,1416,1416,1397,1378,1378,1360,1360,1360,1360,1360,1341,1322,1322,1304,1304,1304,1304,1304,1306,1309,1309,1312,1312,1312,1312,1312,1336,1360,1360,1384,1384,1384,1384,1384,1413,1442,1442,1472,1472,1472,1472,1472,1477,1482,1482,1488,1488,1488,1488,1488,1480,1472,1472,1464,1464,1464,1464,1464,1448,1432,1432,1416,1416,1416,1416,1416,1389,1362,1362,1336,1336,1336,1336,1336,1322,1309,1309,1296,1296,1296,1296,1296,1290,1285,1285,1280,1280,1280,1280,1280,1285,1290,1290,1296,1296,1296,1296,1296,1314,1333,1333,1352,1352,1352,1352,1352,1370,1389,1389,1408,1408,1408,1408,1408,1429,1450,1450,1472,1474,1477,1477,1480,1477,1474,1474,1472,1472,1472,1472,1472,1458,1445,1445,1432,1432,1432,1432,1432,1408,1384,1384,1360,1360,1360,1360,1360,1341,1322,1322,1304,1304,1304,1304,1304,1282,1261,1261,1240,1240,1240,1240,1240,1218,1197,1197,1176,1176,1176,1176,1176,1168,1160,1160,1152,1152,1152,1152,1152,1154,1157,1157,1160,1160,1160,1160,1160,1184,1208,1208,1232,1232,1232,1232,1232,1253,1274,1274,1296,1296,1296,1296,1296,1298,1301,1301,1304,1304,1304,1304,1304,1293,1282,1282,1272,1272,1272,1272,1272,1256,1240,1240,1224,1224,1224,1224,1224,1194,1165,1165,1136,1136,1136,1136,1136,1117,1098,1098,1080,1080,1080,1080,1080,1056,1032,1032,1008,1008,1008,1008,1008,994,981,981,968,968,968,968,968,960,952,952,944,944,944,944,944,960,976,976,992,992,992,992,992,1024,1056,1056,1088,1088,1088,1088,1088,1106,1125,1125,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1128,1112,1112,1096,1096,1096,1096,1096,1053,1010,1010,968,968,968,968,968,928,888,888,848</WaveformData>
      </Waveforms>

这是理想的输出:

VitalSigns, 01/27/2017, 10:33:04 PM, PVC, 0    
VitalSigns, 01/27/2017, 10:33:04 PM, RESP, 68    
VitalSigns, 01/27/2017, 10:33:04 PM, SPO2, 100    
Waveforms, 01/27/2017, 10:33:04 PM, I, 347, 342, ..., end    
Waveforms, 01/27/2017, 10:33:04 PM, II, 347, 342, ..., end    
Waveforms, 01/27/2017, 10:33:04 PM, SPO2, 370, 741, ..., end    
VitalSigns, 01/27/2017, 10:33:06 PM, PVC, 0    
VitalSigns, 01/27/2017, 10:33:06 PM, RESP, 55
VitalSigns, 01/27/2017, 10:33:06 PM, SPO2, 99
...
end of file

1 个答案:

答案 0 :(得分:0)

正如其他评论者指出的那样,XML工具更适合这项任务。但它也可以在sed中完成,这里是GNU sed脚本做你想要的:

<强> script.sed

# concatenate VitalSigns or Waveforms, the date, the time into the holdspace
# on finding a Par, Value or WaveformData, further append the info to the holdspace
/<VitalSigns|Waveforms /  { s/^[^<]*<([^ ]+) Time="([^ ]+) ([^"]+)".*$/\1, \2, \3, /; h; } 
/<Par>.*<\/Par>/          { s/^[^>]+>([^<]*).*$/\1, /; H; }
/<Value.*<\/Value>/       { s/^[^>]+>([^<]*).*$/\1/; 
                            H; 
                            # now print Holdspace after removing newlines
                            x;
                            s/, \n/, /g;
                            p;
                            # now put only the first three columns back into Hold space
                            s/^(([^,]+, ){3}).*$/\1/;
                            x;
                          }

/<WaveformData /          { s/^.*WaveformData Channel="([^"]+)[^>]*>(.*)<\/WaveformData>.*$/\1, \2/; 
                            H;
                            # now print Holdspace after removing newlines
                            x;
                            s/, \n/, /g;
                            p;
                            # now put only the first three columns back into Hold space
                            s/^(([^,]+, ){3}).*$/\1/;
                            x;
                          }

像这样使用:sed -rn -f script.sed yourfile

备注

  • 我使用了问题中发布的示例。它可能对xml文件中的更改不太稳定。
  • 你看到它是一堆乱七八糟的语法,在hold和pattern空间之间来回切换,因此它不是很容易扩展。
  • 更好地使用XML工具,正如其他人已在评论中提出的那样。