Question

In pyspark 1.6.2, I can import col function by

from pyspark.sql.functions import col

but when I try to look it up in the Github source code I find no col function in functions.py file, how can python import a function that doesn't exist?

Answer 1

It exists. It just isn't explicitly defined. Functions exported from pyspark.sql.functions are thin wrappers around JVM code and, with a few exceptions which require special treatment, are generated automatically using helper methods.

If you carefully check the source you'll find col listed among other _functions. This dictionary is further iterated and _create_function is used to generate wrappers. Each generated function is directly assigned to a corresponding name in the globals.

Finally __all__, which defines a list of items exported from the module, just exports all globals excluding ones contained in the blacklist.

If this mechanisms is still not clear you can create a toy example:

Create Python module called foo.py with a following content:

# Creates a function assigned to the name foo
globals()["foo"] = lambda x: "foo {0}".format(x)

# Exports all entries from globals which start with foo
__all__ = [x for x in globals() if x.startswith("foo")]

Place it somewhere on the Python path (for example in the working directory).
Import foo:
```
from foo import foo

foo(1)
```

An undesired side effect of such metaprogramming approach is that defined functions might not be recognized by the tools depending purely on static code analysis. This is not a critical issue and can be safely ignored during development process.

Depending on the IDE installing type annotations might resolve the problem.

Answer 2

如上所述，pyspark会即时生成其某些功能，这使得大多数IDE无法正确检测到它们。但是，有一个python软件包pyspark-stubs，其中包含存根文件的集合，以便改进类型提示，静态错误检测，代码完成等。通过

进行安装

pip install pyspark-stubs==x.x.x

（其中xxx必须替换为您的pyspark版本（例如，在我的情况下为2.3.0。）），col和其他功能，而对于大多数IDE（Pycharm）都无需更改任何代码，Visual Studio代码，Atom，Jupyter Notebook等）

Answer 3

自 VS Code 1.26.1 起，可以通过修改python.linting.pylintArgs设置来解决此问题：

"python.linting.pylintArgs": [
        "--generated-members=pyspark.*",
        "--extension-pkg-whitelist=pyspark",
        "--ignored-modules=pyspark.sql.functions"
    ]

该问题在github上得到了解释：https://github.com/DonJayamanne/pythonVSCode/issues/1418#issuecomment-411506443

Answer 4

在Pycharm中，col函数和其他函数被标记为“未找到”

一种解决方法是导入functions并从那里调用col函数。

例如：

from pyspark.sql import functions as F
df.select(F.col("my_column"))

Answer 5

我遇到类似的问题，尝试使用Eclipse和PyDev建立PySpark开发环境。 PySpark使用动态命名空间。为了使它工作，我需要添加PySpark以“强制使用Builtins”，如下所示。

Forced builtins

Answer 6

如@ zero323所指出的，有几个Spark函数在运行时生成包装器，方法是将它们添加到globals字典，然后将它们添加到__all__中。正如@ vincent-claes所指出的，使用function路径（如F或其他方式，我更喜欢描述性的路径）引用函数可以使其导入，从而在导入时不会显示错误PyCharm。但是，正如@nexaspx在对该答案的评论中提到的那样，这将警告转移到了用法行上。如@thomas所述，可以安装pyspark-stubs来改善这种情况。

但是，如果由于某种原因，添加该软件包不是一个选择（也许您正在为您的环境使用docker映像，现在无法将其添加到映像中），或者它不起作用，这是我的解决方法：首先，仅使用别名为生成的包装添加导入，然后仅禁用对该导入的检查。这样一来，所有使用情况仍然可以在同一条语句中检查其他功能，将警告点减少为一个，然后忽略该警告。

from pyspark.sql import functions as pyspark_functions
# noinspection PyUnresolvedReferences
from pyspark.sql.functions import col as pyspark_col
# ...
pyspark_functions.round(...)
pyspark_col(...)

如果您有多个进口，请将它们分组，这样就只有一个noinspection：

# noinspection PyUnresolvedReferences
from pyspark.sql.functions import (
    col as pyspark_col, count as pyspark_count, expr as pyspark_expr,
    floor as pyspark_floor, log1p as pyspark_log1p, upper as pyspark_upper,
)

（这是我使用Reformat File命令时PyCharm格式化的方式。）

虽然我们讨论的是如何导入pyspark.sql.functions的主题，但建议不要从pyspark.sql.functions导入单个函数，以避免遮盖Python内置函数，因为它们可能导致模糊错误，例如@SARose {{ 3}}。

Cannot find col function in pyspark

6 个答案: