PySpark过滤提供不一致的行为

时间:2019-02-17 21:28:50

标签: pyspark

因此,我有一个数据集,在其中进行一些转换,最后一步是过滤出名为frequency的列中具有0的行。进行过滤的代码非常简单:

def filter_rows(self, name: str = None, frequency_col: str = 'frequency', threshold: int = 1):
    df = getattr(self, name)
    df = df.where(df[frequency_col] >= threshold)
    setattr(self, name, df)
    return self

问题是一个非常奇怪的行为,如果我设置一个较高的阈值(如10),它可以正常工作,过滤掉10以下的所有行。但是,如果我将阈值设为1,则不会删除0!这是前者(threshold=10)的示例:

{"user":"XY1677KBTzDX7EXnf-XRAYW4ZB_vmiNvav7hL42BOhlcxZ8FQ","domain":"3a899ebbaa182778d87d","frequency":12}
{"user":"lhoAWb9U9SXqscEoQQo9JqtZo39nutq3NgrJjba38B10pDkI","domain":"3a899ebbaa182778d87d","frequency":9}
{"user":"aRXbwY0HcOoRT302M8PCnzOQx9bOhDG9Z_fSUq17qtLt6q6FI","domain":"33bd29288f507256d4b2","frequency":23}
{"user":"RhfrV_ngDpJex7LzEhtgmWk","domain":"390b4f317c40ac486d63","frequency":14}
{"user":"qZqqsNSNko1V9eYhJB3lPmPp0p5bKSq0","domain":"390b4f317c40ac486d63","frequency":11}
{"user":"gsmP6RG13azQRmQ-RxcN4MWGLxcx0Grs","domain":"f4765996305ccdfa9650","frequency":10}
{"user":"jpYTnYjVkZ0aVexb_L3ZqnM86W8fr082HwLliWWiqhnKY5A96zwWZKNxC","domain":"f4765996305ccdfa9650","frequency":15}
{"user":"Tlgyxk_rJF6uE8cLM2sArPRxiOOpnLwQo2s","domain":"f89838b928d5070c3bc3","frequency":17}
{"user":"qHu7fpnz2lrBGFltj98knzzbwWDfU","domain":"f89838b928d5070c3bc3","frequency":11}
{"user":"k0tU5QZjRkBwqkKvMIDWd565YYGHfg","domain":"f89838b928d5070c3bc3","frequency":17}

现在这里是threshold=1的一些数据:

{"user":"KuhSEPFKACJdNyMBBD2i6ul0Nc_b72J4","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"EP1LomZ3qAMV3YtduC20","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"UxulBfshmCro-srE3Cs5znxO5tnVfc0_yFps","domain":"d69cb6f62b885fec9b7d","frequency":1}
{"user":"v2OX7UyvMVnWlDeDyYC8Opk-va_i8AwxZEsxbk","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"4hu1uE2ucAYZIrNLeOY2y9JMaArFZGRqjgKzlKenC5-GfxDJQQbLcXNSzj","domain":"68b588cedbc66945c442","frequency":0}
{"user":"5rFMWm_A-7N1E9T289iZ65TIR_JG_OnZpJ-g","domain":"68b588cedbc66945c442","frequency":1}
{"user":"RLqoxFMZ7Si3CTPN1AnI4hj6zpwMCJI","domain":"68b588cedbc66945c442","frequency":1}
{"user":"wolq9L0592MGRfV_M-FxJ5Wc8UUirjqjMdaMDrI","domain":"68b588cedbc66945c442","frequency":0}
{"user":"9spTLehI2w0fHcxyvaxIfo","domain":"68b588cedbc66945c442","frequency":1}

我应该注意,在执行此步骤之前,我还执行了其他一些转换,并且我注意到过去Spark中的怪异行为有时会在加入或合并后做一些非常简单的事情,从而产生非常奇怪的结果,而最终只有解决方案是写出数据并再次读回,然后使用完全独立的脚本进行操作。我希望有比这更好的解决方案!

0 个答案:

没有答案