如何模拟对pyspark SQL函数的内部调用

时间:2019-11-01 22:02:13

标签: python apache-spark pyspark mocking python-unittest

获得以下pyspark代码:

import pyspark.sql.functions as F

null_or_unknown_count = df.sample(0.01).filter(
    F.col('env').isNull() | (F.col('env') == 'Unknown')
).count()

在测试代码中,数据帧是模拟的,所以我试图像这样设置此调用的return_value:

from unittest import mock
from unittest.mock import ANY

@mock.patch('pyspark.sql.DataFrame', spec=pyspark.sql.DataFrame)
def test_null_or_unknown_validation(self, mock_df):
    mock_df.sample(0.01).filter(ANY).count.return_value = 250

但是此操作失败,并显示以下内容:

File "/usr/local/lib/python3.7/site-packages/pyspark/sql/functions.py", line 44, in _
  jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col)
AttributeError: 'NoneType' object has no attribute '_jvm'

还尝试了mock_df.sample().filter().count.return_value = 250,它给出了相同的错误。

如何正确模拟过滤器,即F.col('env').isNull() | (F.col('env') == 'Unknown')

1 个答案:

答案 0 :(得分:2)

感谢我聪明的同事在工作,这就是答案。我们必须模拟pyspark.sql.functions.col,然后设置return_value。

@mock.patch('pyspark.sql.functions.col')
@mock.patch('pyspark.sql.DataFrame', spec=pyspark.sql.DataFrame)
def test_null_or_unknown_validation(self, mock_df, mock_functions):
    mock_functions.isNull.return_value = True # (or False also works)
    mock_df.sample(0.01).filter(ANY).count.return_value = 250

使用mock_df.sample().filter().count.return_value = 250也可以。