在hadoop或pig
中解析下面提到的xml类型我在猪或蜂巢中试过以下脚本
PowerEvent sequence="00829" elapsedrealtime="0000047391" uptime="0000047391" timestamp="2016-01-17 00:31:36.750+0100" health="Good" level="69" plugged="NotPlugged" present="Present" status="NotCharging" temperature="23.0" voltage="3731" chargercurrent="25" batterycurrent="2209" coulombcounter="4294967292" screen="Off"
ConnectivityEvent sequence="00830" elapsedrealtime="0000047471" uptime="0000047471" timestamp="2016-01-17 00:31:36.831+0100" connected ="true" available="true" activenetwork="WIFI" mobiledata="Off" cellular="Unknown" operatorid="22210" operatorname="vodafone IT"
我尝试使用以下脚本
register '/home/rajpsu03/pig/piggybank.jar'
xmldata = LOAD '/user/rajpsu03/pig/test.xml' USING org.apache.pig.piggybank.storage.XMLLoader('Events') as(doc:chararray);
data = foreach xmldata
GENERATE FLATTEN(REGEX_EXTRACT_ALL(doc,'[\\s*\\S*]*<PowerEvent[\\s*\\S*]*sequence="(.*?)"[\\s*\\S*]*/>[\\s*\\S*]*<PowerEvent[\\s*\\S*]*elapsedrealtime="(.*?)"[\\s*\\S*]*/>[\\s*\\S*]*<PowerEvent[\\s*\\S*]*uptime="(.*?)"[\\s*\\S*]*/>')) AS (sequence:chararray,elapsedrealtime:chararray,uptime:chararray);
dump data;