Cannot find col function in pyspark

时间:2016-10-20 19:38:58

标签: python apache-spark pyspark apache-spark-sql pyspark-sql

In pyspark 1.6.2, I can import col function by

from pyspark.sql.functions import col

but when I try to look it up in the Github source code I find no col function in functions.py file, how can python import a function that doesn't exist?

6 个答案:

答案 0 :(得分:31)

It exists. It just isn't explicitly defined. Functions exported from pyspark.sql.functions are thin wrappers around JVM code and, with a few exceptions which require special treatment, are generated automatically using helper methods.

If you carefully check the source you'll find col listed among other _functions. This dictionary is further iterated and _create_function is used to generate wrappers. Each generated function is directly assigned to a corresponding name in the globals.

Finally __all__, which defines a list of items exported from the module, just exports all globals excluding ones contained in the blacklist.

If this mechanisms is still not clear you can create a toy example:

  • Create Python module called foo.py with a following content:

    # Creates a function assigned to the name foo
    globals()["foo"] = lambda x: "foo {0}".format(x)
    
    # Exports all entries from globals which start with foo
    __all__ = [x for x in globals() if x.startswith("foo")]
    
  • Place it somewhere on the Python path (for example in the working directory).

  • Import foo:

    from foo import foo
    
    foo(1)
    

An undesired side effect of such metaprogramming approach is that defined functions might not be recognized by the tools depending purely on static code analysis. This is not a critical issue and can be safely ignored during development process.

Depending on the IDE installing type annotations might resolve the problem.

答案 1 :(得分:11)

如上所述,pyspark会即时生成其某些功能,这使得大多数IDE无法正确检测到它们。但是,有一个python软件包pyspark-stubs,其中包含存根文件的集合,以便改进类型提示,静态错误检测,代码完成等。 通过

进行安装
pip install pyspark-stubs==x.x.x

(其中xxx必须替换为您的pyspark版本(例如,在我的情况下为2.3.0。)),col和其他功能,而对于大多数IDE(Pycharm)都无需更改任何代码,Visual Studio代码,Atom,Jupyter Notebook等)

答案 2 :(得分:5)

VS Code 1.26.1 起,可以通过修改python.linting.pylintArgs设置来解决此问题:

"python.linting.pylintArgs": [
        "--generated-members=pyspark.*",
        "--extension-pkg-whitelist=pyspark",
        "--ignored-modules=pyspark.sql.functions"
    ]

该问题在github上得到了解释:https://github.com/DonJayamanne/pythonVSCode/issues/1418#issuecomment-411506443

答案 3 :(得分:4)

在Pycharm中,col函数和其他函数被标记为“未找到”

enter image description here

一种解决方法是导入functions并从那里调用col函数。

例如:

from pyspark.sql import functions as F
df.select(F.col("my_column"))

答案 4 :(得分:0)

我遇到类似的问题,尝试使用Eclipse和PyDev建立PySpark开发环境。 PySpark使用动态命名空间。为了使它工作,我需要添加PySpark以“强制使用Builtins”,如下所示。

Forced builtins

答案 5 :(得分:0)

如@ zero323所指出的,有几个Spark函数在运行时生成包装器,方法是将它们添加到globals字典,然后将它们添加到__all__中。正如@ vincent-claes所指出的,使用function路径(如F或其他方式,我更喜欢描述性的路径)引用函数可以使其导入,从而在导入时不会显示错误PyCharm。但是,正如@nexaspx在对该答案的评论中提到的那样,这将警告转移到了用法行上。如@thomas所述,可以安装pyspark-stubs来改善这种情况。

但是,如果由于某种原因,添加该软件包不是一个选择(也许您正在为您的环境使用docker映像,现在无法将其添加到映像中),或者它不起作用,这是我的解决方法:首先,仅使用别名为生成的包装添加导入,然后仅禁用对该导入的检查。这样一来,所有使用情况仍然可以在同一条语句中检查其他功能,将警告点减少为一个,然后忽略该警告。

from pyspark.sql import functions as pyspark_functions
# noinspection PyUnresolvedReferences
from pyspark.sql.functions import col as pyspark_col
# ...
pyspark_functions.round(...)
pyspark_col(...)

如果您有多个进口,请将它们分组,这样就只有一个noinspection

# noinspection PyUnresolvedReferences
from pyspark.sql.functions import (
    col as pyspark_col, count as pyspark_count, expr as pyspark_expr,
    floor as pyspark_floor, log1p as pyspark_log1p, upper as pyspark_upper,
)

(这是我使用Reformat File命令时PyCharm格式化的方式。)

虽然我们讨论的是如何导入pyspark.sql.functions的主题,但建议不要从pyspark.sql.functions导入单个函数,以避免遮盖Python内置函数,因为它们可能导致模糊错误,例如@SARose {{ 3}}。