如何根据dml验证输入数据是否正确。
输入数据: 豪尔赫波萨达|洋基队| {(捕手,2000年),(Designated_hitter,2001年)} | [游戏#1594,#hit_by_pitch 65,grand_slams#7] Landon Powell |奥克兰| {(Catcher,2000),(First_baseman,2001)} | [on_base_percentage#0.297,游戏#26,home_runs#7] 马丁普拉多|亚特兰大| {(Second_baseman,2002年),(内野手,2003),的(Left_fielder)} | [游戏#258,hit_by_pitch#3]
请参阅粗体部分,我错过了年度字段。 bfile = LOAD'basketping1.txt'使用PigStorage('|')作为(名称:chararray,团队:chararray,pos:bag {t:tuple(point:chararray,year:int)},bat:map []);
dump bfile; (Jorge Posada,Yankees,{(Catcher,2000),(Designated_hitter,2001)},[游戏#1594,hit_by_pitch#65,grand_slams#7]) (Landon Powell,Oakland,{(Catcher,2000),(First_baseman,2001)},[on_base_percentage#0.297,游戏#26,home_runs#7]) (Martin Prado,亚特兰大,[游戏#258,hit_by_pitch#3])
此致 Sanjeeb
答案 0 :(得分:1)
以下是架构的正则表达式脚本,主要是我验证了所有字段。如果您需要其他验证,请反对您的输入并告诉我。
<强>正则表达式:强>
'^
([A-Za-z]+\\s+[A-Za-z]+)\\s*\\|\\s*
([A-Za-z]+)\\s*\\|\\s*
(\\{(?:\\([A-Za-z_]+,[0-9]+\\))(?:,\\([A-Za-z_]+,[0-9]+\\))*\\})\\s*\\|\\s*
(\\[(?:[A-Za-z_]+#[0-9\\.]+)(?:,[A-Za-z_]+#[0-9\\.]+)*\\])
$'
<强> input.txt中强>
我已标记下面的每个输入有效或无效
Jorge Posada |Yankees| {(Catcher,2000),(Designated_hitter,2001)}|[games#1594,hit_by_pitch#65,grand_slams#7] -->Valid
Landon Powell |Oakland|{(Catcher,2000),(First_baseman,2001)}|[on_base_percentage#0.297,games#26,home_runs#7] ->Valid
Martin Prado |Atlanta| {(Second_baseman,2002),(Infielder,2003),(Left_fielder)}|[games#258,hit_by_pitch#3] -->Invalid year missing
Martin Prado |Atlanta| {(Second_baseman,2002)(Infielder,2003)}|[games#258,hit_by_pitch#3] ->Invalid no comma between two tuples
Martin Prado |Atlanta| {,(Second_baseman,2002),(Infielder,2003)}|[games#258,hit_by_pitch#3] --> Invalid comma in the start of tuple
Martin Prado |Atlanta| {(Second_baseman,2002),(,2003)}|[games#258,hit_by_pitch#3] -->Invalid position is missing
Martin Prado |Atlanta| {(Second_baseman,2002),(Infielder,2003)}[games#258,hit_by_pitch#3] --> Invalid Demiiter | is missing
Martin Prado || {(Second_baseman,2002),(Infielder,2003)}[games#258,hit_by_pitch#3] --> Invalid Team name is missing
Martin Prado |Atlanta| {(Second_baseman,2002),(Infielder,2003)}[games#,hit_by_pitch#3] --> Invalid Key value is missing for games
Landon Powell |Oakland|{(Catcher,2000)}|[on_base_percentage#0.297] --> Valid
Landon Powell |Oakland|{(Catcher,2000),(First_baseman,2001),(test,3000)}|[on_base_percentage#0.297,games#26,home_runs#7,test#1.2] -->valid
<强> PigScript:强>
A = LOAD 'input.txt' AS line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'^([A-Za-z]+\\s+[A-Za-z]+)\\s*\\|\\s*([A-Za-z]+)\\s*\\|\\s*(\\{(?:\\([A-Za-z_]+,[0-9]+\\))(?:,\\([A-Za-z_]+,[0-9]+\\))*\\})\\s*\\|\\s*(\\[(?:[A-Za-z_]+#[0-9\\.]+)(?:,[A-Za-z_]+#[0-9\\.]+)*\\])$')) AS (name:chararray,team:chararray,pos:bag{t:(p:chararray)},bat:map[]);;
DUMP B;
输出:如果输入与架构不匹配,则会将输出打印为null。
(Jorge Posada,Yankees,{(Catcher,2000),(Designated_hitter,2001)},[games#1594,hit_by_pitch#65,grand_slams#7]) -->Valid
(Landon Powell,Oakland,{(Catcher,2000),(First_baseman,2001)},[on_base_percentage#0.297,games#26,home_runs#7]) -->Valid
() -->Invalid,Year missing
() -->Invalid,No comma between two tuples
() -->Invalid,Comma in the start of tuple
() -->Invalid,Position is missing
() -->Invalid,Demiiter | is missing
() -->Invalid Team name is missing
() -->Invalid Key value is missing for games
(Landon Powell,Oakland,{(Catcher,2000)},[on_base_percentage#0.297]) -->Valid
(Landon Powell,Oakland,{(Catcher,2000),(First_baseman,2001),(test,3000)},[on_base_percentage#0.297,games#26,home_runs#7,test#1.2]) -->valid