文本解析将方程块转换为Python语句

时间:2014-04-01 19:07:33

标签: python text-parsing

我有一大组SAS方程式语句,并希望使用Python将这些方程式转换为Python语句。 它们如下所示:

来自SAS:

select;
    when(X_1 <= 6.7278       )    V_1    =-0.0594    ;
    when(X_1 <= 19.5338      )    V_1    =0.0604     ;
    when(X_1 <= 45.1458      )    V_1    =0.1755     ;
    when(X_1 <= 83.5638      )    V_1    =0.2867     ;
    when(X_1 <= 203.0878     )    V_1    =0.395      ;
    when(X_1  > 203.0878     )    V_1    =0.5011     ;
end;
label V_1 ="X_1 ";
select;
    when(X_2 <= 0.0836              )    V_2    =0.0562     ;
    when(X_2 <= 0.1826              )    V_2    =0.07       ;
    when(X_2 <= 0.2486              )    V_2    =0.0836     ;
    when(X_2 <= 0.3146              )    V_2    =0.0969     ;
    when(X_2 <= 0.3806              )    V_2    =0.1095     ;
    when(X_2 <= 0.4466              )    V_2    =0.1212     ;
    when(X_2 <= 0.5126              )    V_2    =0.132      ;
    when(X_2 <= 0.5786              )    V_2    =0.1419     ;
    when(X_2 <= 0.6446              )    V_2    =0.1511     ;
    when(X_2 <= 0.7106              )    V_2    =0.1596     ;
    when(X_2 <= 0.8526              )    V_2    =0.1679     ;
    when(X_2  > 0.8526              )    V_2    =0.176      ;
end;
label V_2 ="X_2 ";
...
...
...

到Python:

if X_1 <= 6.7278:
    V_1    =-0.0594
elif X_1 <= 19.5338:
    V_1    =0.0604
elif X_1 <= 45.1458:
    V_1    =0.1755
elif X_1 <= 83.5638:
    V_1    =0.2867
elif X_1 <= 203.0878:
    V_1    =0.395
else: 
    V_1    =0.5011

if X_2 <= 0.0836:
    ....

我不知道从哪里开始,就像使用&#39;&#39;包或其他任何东西。任何帮助都会非常感激!

1 个答案:

答案 0 :(得分:3)

如果输入非常一致(如图所示),您可能会使用re

对于更复杂的事情,您可能希望查看更强大的解析器,如pyparsing


编辑:这是一个使用正则表达式的非常简单的有限状态机解析器;它处理空行,未通过select;end;语句以及初始/后续when。我不处理label因为我不确定他们做了什么 - 将V变量重命名为X?

import re

class SasTranslator:
    def __init__(self):
        # modes:
        #   0   not in START..END
        #   1   in START..END, no CASE seen yet
        #   2   in START..END, CASE already found
        self.mode   = 0
        self.offset = -1   # input line #

    def handle_blank(self, match):
        return ""

    def handle_start(self, match):
        if self.mode == 0:
            self.mode = 1
            return None
        else:
            raise ValueError("Found 'select;' in select block, line {}".format(self.offset))

    def handle_end(self, match):
        if self.mode == 0:
            raise ValueError("Found 'end;' with no opening 'select;', line {}".format(self.offset))
        elif self.mode == 1:
            raise ValueError("Found empty 'select;' .. 'end;', line {}".format(self.offset))
        elif self.mode == 2:
            self.mode = 0
            return None

    def handle_case(self, match):
        if self.mode == 0:
            raise ValueError("Found 'when' clause outside 'select;' .. 'end;', line {}".format(self.offset))
        elif self.mode == 1:
            test = "if"
            self.mode = 2
            # note: code continues after if..else block
        elif self.mode == 2:
            test = "elif"
            # note: code continues after if..else block

        test_var, op, test_val, assign_var, assign_val = match.groups()
        return (
            "{test} {test_var} {op} {test_val}:\n"
            "    {assign_var} = {assign_val}".format(
                test       = test,
                test_var   = test_var,
                op         = op,
                test_val   = test_val,
                assign_var = assign_var,
                assign_val = assign_val
            )
        )

    #
    # Build a dispatch table for the handlers
    #

    BLANK    = re.compile("\s*$")
    START    = re.compile("select;\s*$")
    END      = re.compile("end;\s*$")
    CASE     = re.compile("\s*when\((\w+)\s*([<>=]+)\s*([\d.-]+)\s*\)\s*(\w+)\s*=\s*([\d.-]+)\s*;\s*$")

    dispatch_table = [
        (BLANK, handle_blank),
        (START, handle_start),
        (END,   handle_end),
        (CASE,  handle_case)
    ]

    def __call__(self, line):
        """
        Translate a single line of input
        """
        self.offset += 1

        for test,handler in SasTranslator.dispatch_table:
            match = test.match(line)
            if match is not None:
                return handler(self, match)

        # nothing matched!
        return None

def main():
    with open("my_file.sas") as inf:
        trans = SasTranslator()
        for line in inf:
            result = trans(line)
            if result is not None:
                print(result)
            else:
                print("***unknown*** {}".format(line.rstrip()))

if __name__=="__main__":
    main()

并针对您生成的样本输入运行

if X_1 <= 6.7278:
    V_1 = -0.0594
elif X_1 <= 19.5338:
    V_1 = 0.0604
elif X_1 <= 45.1458:
    V_1 = 0.1755
elif X_1 <= 83.5638:
    V_1 = 0.2867
elif X_1 <= 203.0878:
    V_1 = 0.395
elif X_1 > 203.0878:
    V_1 = 0.5011
***unknown*** label V_1 ="X_1 ";
if X_2 <= 0.0836:
    V_2 = 0.0562
elif X_2 <= 0.1826:
    V_2 = 0.07
elif X_2 <= 0.2486:
    V_2 = 0.0836
elif X_2 <= 0.3146:
    V_2 = 0.0969
elif X_2 <= 0.3806:
    V_2 = 0.1095
elif X_2 <= 0.4466:
    V_2 = 0.1212
elif X_2 <= 0.5126:
    V_2 = 0.132
elif X_2 <= 0.5786:
    V_2 = 0.1419
elif X_2 <= 0.6446:
    V_2 = 0.1511
elif X_2 <= 0.7106:
    V_2 = 0.1596
elif X_2 <= 0.8526:
    V_2 = 0.1679
elif X_2 > 0.8526:
    V_2 = 0.176
***unknown*** label V_2 ="X_2 ";

根据您使用此频率的频率,使用bisect并将select; .. end;块转换为该表单可能值得进行二项查找功能(尽管您可以我要非常小心,比较运算符是你所期望的!) - 类似

V_1 = index_into(
    X_1,
    [ 6.7278, 19.5338, 45.1458, 83.5638, 203.0878        ],
    [-0.0594,  0.0604,  0.1755,  0.2867,   0.395,  0.5011]
)

它可以明显更快地运行(特别是随着选项数量的增加)并且更容易理解和维护。