我在一个纪元时间里有一个包含id_no
和timestamp
列的pandas数据帧,因此它们是整数,我想通过查看是否存在与时间相关的id_no
来识别时间戳发生在任何其他时间范围内。我需要使用spark,因为我的数据集非常大,所以输入格式是一个pandas DataFrame。
所需的输出将是([list of time related Id_nos], number of times pattern has occured)
spark_df = context.createDataFrame(DataFrame)
context.registerDataFrameAsTable(spark_df, "table1")
time_vec = DataFrame['timestamp'].unique()
timeRDD = sc.parallelize(time_vec)
def f(x):
lo = int(x) - 1
hi = int(x) + 1
temp = context.sql(str('SELECT id_no FROM table1 WHERE timestamp BETWEEN ' +str(lo) +str(' AND ') +str(hi))).collect()
if len(temp)>1: #making sure it detected more than just itself
pats = str(temp)
Er_code = pats.replace("Row(id_no=u'", "").replace("')", "") #cleaning up the output
return Er_code
patterns = timeRDD.foreach(f).collect()
print patterns
目标是提取在相似时间发生的ID号。 我已经找到了一种方法来做到这一点没有火花,但我需要额外的并行化,因为我现在正在处理大数据。我想我的代码将在数据框中找到所有唯一的时间戳,将其转换为spark数据帧,然后将此唯一日期列表转换为spark RDD,然后根据不同的时间戳运行查询并仅返回id_no的在lo / hi。
指定的范围内具有匹配的时间戳我已经让这个工作只需使用这样的for循环:
time_vec = DataFrame['event_times_epoch'].unique()
timeRDD = sc.parallelize(time_vec)
pat = []
for y in time_vec:
lo = int(y) - 1
hi = int(y) + 1
temp = context.sql(str('SELECT sensor_id FROM table1 WHERE event_times_epoch BETWEEN ' +str(lo) +str(' AND ') +str(hi))).collect()
if len(temp)>1:
pats = str(temp)
Er_code = pats.replace("Row(sensor_id=u'", "").replace("')", "")
pat.append(Er_code)
#patterns = timeRDD.foreach(f).collect()
print pat
迭代地处理列表非常慢,所以我认为并行化列表会加快速度,但不幸的是,当我尝试使用foreach
时,spark会给我一长串错误。
这是错误:
patterns = timeRDD.foreach(f).collect()
File "/home/jan/Documents/spark-1.4.0/python/pyspark/rdd.py", line 721, in foreach
self.mapPartitions(processPartition).count() # Force evaluation
File "/home/jan/Documents/spark-1.4.0/python/pyspark/rdd.py", line 972, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/home/jan/Documents/spark-1.4.0/python/pyspark/rdd.py", line 963, in sum
return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
File "/home/jan/Documents/spark-1.4.0/python/pyspark/rdd.py", line 771, in reduce
vals = self.mapPartitions(func).collect()
File "/home/jan/Documents/spark-1.4.0/python/pyspark/rdd.py", line 745, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/home/jan/Documents/spark-1.4.0/python/pyspark/rdd.py", line 2351, in _jrdd
pickled_cmd, bvars, env, includes = _prepare_for_python_RDD(self.ctx, command, self)
File "/home/jan/Documents/spark-1.4.0/python/pyspark/rdd.py", line 2271, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/home/jan/Documents/spark-1.4.0/python/pyspark/serializers.py", line 427, in dumps
return cloudpickle.dumps(obj, 2)
File "/home/jan/Documents/spark-1.4.0/python/pyspark/cloudpickle.py", line 622, in dumps
cp.dump(obj)
File "/home/jan/Documents/spark-1.4.0/python/pyspark/cloudpickle.py", line 107, in dump
return Pickler.dump(self, obj)
File "/usr/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 562, in save_tuple
save(element)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/jan/Documents/spark-1.4.0/python/pyspark/cloudpickle.py", line 199, in save_function
self.save_function_tuple(obj)
File "/home/jan/Documents/spark-1.4.0/python/pyspark/cloudpickle.py", line 236, in save_function_tuple
save((code, closure, base_globals))
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple
save(element)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 600, in save_list
self._batch_appends(iter(obj))
File "/usr/lib/python2.7/pickle.py", line 633, in _batch_appends
save(x)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/jan/Documents/spark-1.4.0/python/pyspark/cloudpickle.py", line 199, in save_function
self.save_function_tuple(obj)
File "/home/jan/Documents/spark-1.4.0/python/pyspark/cloudpickle.py", line 236, in save_function_tuple
save((code, closure, base_globals))
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple
save(element)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 600, in save_list
self._batch_appends(iter(obj))
File "/usr/lib/python2.7/pickle.py", line 633, in _batch_appends
save(x)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/jan/Documents/spark-1.4.0/python/pyspark/cloudpickle.py", line 199, in save_function
self.save_function_tuple(obj)
File "/home/jan/Documents/spark-1.4.0/python/pyspark/cloudpickle.py", line 236, in save_function_tuple
save((code, closure, base_globals))
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple
save(element)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 600, in save_list
self._batch_appends(iter(obj))
File "/usr/lib/python2.7/pickle.py", line 633, in _batch_appends
save(x)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/jan/Documents/spark-1.4.0/python/pyspark/cloudpickle.py", line 199, in save_function
self.save_function_tuple(obj)
File "/home/jan/Documents/spark-1.4.0/python/pyspark/cloudpickle.py", line 236, in save_function_tuple
save((code, closure, base_globals))
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple
save(element)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 600, in save_list
self._batch_appends(iter(obj))
File "/usr/lib/python2.7/pickle.py", line 636, in _batch_appends
save(tmp[0])
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/jan/Documents/spark-1.4.0/python/pyspark/cloudpickle.py", line 199, in save_function
self.save_function_tuple(obj)
File "/home/jan/Documents/spark-1.4.0/python/pyspark/cloudpickle.py", line 236, in save_function_tuple
save((code, closure, base_globals))
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple
save(element)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 600, in save_list
self._batch_appends(iter(obj))
File "/usr/lib/python2.7/pickle.py", line 636, in _batch_appends
save(tmp[0])
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/home/jan/Documents/spark-1.4.0/python/pyspark/cloudpickle.py", line 199, in save_function
self.save_function_tuple(obj)
File "/home/jan/Documents/spark-1.4.0/python/pyspark/cloudpickle.py", line 236, in save_function_tuple
save((code, closure, base_globals))
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple
save(element)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 600, in save_list
self._batch_appends(iter(obj))
File "/usr/lib/python2.7/pickle.py", line 636, in _batch_appends
save(tmp[0])
File "/usr/lib/python2.7/pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "/home/jan/Documents/spark-1.4.0/python/pyspark/cloudpickle.py", line 518, in save_reduce
save(state)
File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/usr/lib/python2.7/pickle.py", line 681, in _batch_setitems
save(v)
File "/usr/lib/python2.7/pickle.py", line 306, in save
rv = reduce(self.proto)
File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line 726, in __getattr__
raise Py4JError('Trying to call a package.')
py4j.protocol.Py4JError: Trying to call a package.
我认为foreach
命令应该按我的意愿行事,但显然不行。我尝试了其他排列,例如使用.map(lambda x: f(x))
函数但没有成功。每次调用包时都会引发相同的错误。