Apache-Flink 1.11无法在Java Flink流作业中通过SQL Function DDL使用Python UDF

时间:2020-08-03 21:49:42

标签: python apache-flink flink-cep pyflink flink-table-api

Flip-106中,有一个示例,说明如何通过SQL Function DDL在批处理作业Java应用程序中调用用户定义的python函数...

BatchTableEnvironment tEnv = BatchTableEnvironment.create(env);
tEnv.getConfig().getConfiguration().setString("python.files", "/home/my/test1.py");
tEnv.getConfig().getConfiguration().setString("python.client.executable", "python3");

tEnv.sqlUpdate("create temporary system function func1 as 'test1.func1' language python");
Table table = tEnv.fromDataSet(env.fromElements("1", "2", "3")).as("str").select("func1(str)");
tEnv.toDataSet(table, String.class).collect();

我一直在尝试在流作业Java应用程序中重现相同的示例,这是我的代码:

final StreamTableEnvironment fsTableEnv = StreamTableEnvironment.create(EnvironmentConfiguration.getEnv(), fsSettings);
fsTableEnv.getConfig().getConfiguration().setString("python.files", "/Users/jf/Desktop/flink/fca/test.py");
fsTableEnv.getConfig().getConfiguration().setString("python.client.executable", "/Users/jf/opt/anaconda3/bin/python");

fsTableEnv.sqlUpdate("CREATE TEMPORARY SYSTEM FUNCTION func1 AS 'test.func1' LANGUAGE PYTHON");
Table table = fsTableEnv.fromValues("1", "2", "3").as("str").select("func1(str)");
/* Missing line */

对于批处理作业中的这一行:

tEnv.toDataSet(table, String.class).collect();

我没有找到与流媒体工作相同的内容

1。您能帮我把这个Flip-106示例从批次映射到流吗?

我通常想用flink 1.11在像这样的流作业Java flink应用程序中调用一个python函数:

final StreamTableEnvironment fsTableEnv = StreamTableEnvironment.create(EnvironmentConfiguration.getEnv(), fsSettings);
fsTableEnv.getConfig().getConfiguration().setString("python.files", "/Users/jf/Desktop/flink/fca/test.py");
fsTableEnv.getConfig().getConfiguration().setString("python.client.executable", "/Users/jf/opt/anaconda3/bin/python");

fsTableEnv.sqlUpdate("CREATE TEMPORARY SYSTEM FUNCTION func1 AS 'test.func1' LANGUAGE PYTHON");
final Table table = fsTableEnv.fromDataStream(stream_filtered.map(x->x.idsUmid)).select("func1(f0)").as("umid");
System.out.println("Result --> " + table.select($("umid")) + " --> End of Result");

,并使用该 udf 的结果进行进一步处理(不一定在控制台中将其打印出来)

我已经编辑了test.py文件,以查看是否至少不管未命名的表都在python中进行了某些操作。

from pyflink.table.types import DataTypes
from pyflink.table.udf import udf
from os import getcwd

@udf(input_types=[DataTypes.STRING()], result_type=DataTypes.STRING())
def func1(line):
    print(line)
    print(getcwd())
    with open("test.txt", "a") as myfile:
        myfile.write(line)
    return line

,不打印任何内容,不会创建test.txt文件,并且该值不会返回到流作业。因此,基本上不会调用此python函数。

2。我在这里缺少什么?

到目前为止,感谢David,Wei和Xingbo的支持,因为建议的每个细节都对我有用。

最好的问候

乔纳森

1 个答案:

答案 0 :(得分:0)

您可以尝试以下方法:

final StreamTableEnvironment fsTableEnv = StreamTableEnvironment.create(EnvironmentConfiguration.getEnv(), fsSettings);
fsTableEnv.getConfig().getConfiguration().setString("python.files", "/Users/jf/Desktop/flink/fca/test.py");
fsTableEnv.getConfig().getConfiguration().setString("python.client.executable", "/Users/jf/opt/anaconda3/bin/python");

// You need to specify the python interpreter used to run the python udf on cluster.
// I assume this is a local program so it is the same as the "python.client.executable".
fsTableEnv.getConfig().getConfiguration().setString("python.executable", "/Users/jf/opt/anaconda3/bin/python");

fsTableEnv.sqlUpdate("CREATE TEMPORARY SYSTEM FUNCTION func1 AS 'test.func1' LANGUAGE PYTHON");
final Table table = fsTableEnv.fromDataStream(stream_filtered.map(x->x.idsUmid)).select("func1(f0)").as("umid");

// 'table.select($("umid"))' will not trigger job execution. You need to call the "execute()" method explicitly.
table.execute().print();