我正在使用Google Colab,并尝试在pyspark所调用的函数中使用斯坦福sutime库。
此函数获取给定RDD的一行,然后使用sutime库返回(句子,频率)对。
def convert1(row):
s=str(row.dosefrequency)
s=s.lower()
try:
i=sutime.parse(s) #This parses the input string and output the frequency like P1D(Per Day).
if len(i)>0:
if 'timex-value' in i[0]:
return [s,i[0]['timex-value']]
else:
return []
else:
return []
except Exception as e:
return []
我的输入RDD看起来像:-
rdd.take(3)
'''
[Row(practiceid=701, dosequantity='200', dosefrequency='take 2 tablet by oral route every day', count_dosequantity=716, count_dosefrequency=1, count_patientuid=306, DM Current -hychqudose='200mg', DM Expected Value='400mg'),
Row(practiceid=595, dosequantity='200', dosefrequency='take 1 tablet by oral route 2 times every day', count_dosequantity=327, count_dosefrequency=1, count_patientuid=230, DM Current -hychqudose='200mg', DM Expected Value='400mg'),
Row(practiceid=623, dosequantity='200', dosefrequency='take 1 (200MG) by oral route 2 times every day', count_dosequantity=339, count_dosefrequency=1, count_patientuid=180, DM Current -hychqudose='200mg', DM Expected Value='400mg')]
'''
这就是我使用flatmap调用函数的方式:
details = rdd.flatMap(lambda row: convert1(row)).collect()
但这给了我以下错误:
Traceback (most recent call last):
File "/usr/lib/python3.6/pickle.py", line 916, in save_global
__import__(module_name, level=0)
ModuleNotFoundError: No module named 'edu'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 841, in save_global
return Pickler.save_global(self, obj, name=name)
File "/usr/lib/python3.6/pickle.py", line 922, in save_global
(obj, module_name, name))
_pickle.PicklingError: Can't pickle <java class 'edu.stanford.nlp.python.SUTimeWrapper'>: it's not found as edu.stanford.nlp.python.edu.stanford.nlp.python.SUTimeWrapper
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/pickle.py", line 916, in save_global
__import__(module_name, level=0)
ModuleNotFoundError: No module named 'java'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 841, in save_global
return Pickler.save_global(self, obj, name=name)
File "/usr/lib/python3.6/pickle.py", line 922, in save_global
(obj, module_name, name))
_pickle.PicklingError: Can't pickle <java class 'java.lang.Object'>: it's not found as java.lang.java.lang.Object
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/serializers.py", line 468, in dumps
return cloudpickle.dumps(obj, pickle_protocol)
File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 1097, in dumps
cp.dump(obj)
File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 357, in dump
return Pickler.dump(self, obj)
File "/usr/lib/python3.6/pickle.py", line 409, in dump
self.save(obj)
File "/usr/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python3.6/pickle.py", line 751, in save_tuple
save(element)
File "/usr/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 501, in save_function
self.save_function_tuple(obj)
File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 730, in save_function_tuple
save(state)
File "/usr/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python3.6/pickle.py", line 821, in save_dict
self._batch_setitems(obj.items())
File "/usr/lib/python3.6/pickle.py", line 847, in _batch_setitems
save(v)
File "/usr/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python3.6/pickle.py", line 781, in save_list
self._batch_appends(obj)
File "/usr/lib/python3.6/pickle.py", line 808, in _batch_appends
save(tmp[0])
File "/usr/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 496, in save_function
self.save_function_tuple(obj)
File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 730, in save_function_tuple
save(state)
File "/usr/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python3.6/pickle.py", line 821, in save_dict
self._batch_setitems(obj.items())
File "/usr/lib/python3.6/pickle.py", line 847, in _batch_setitems
save(v)
File "/usr/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python3.6/pickle.py", line 821, in save_dict
self._batch_setitems(obj.items())
File "/usr/lib/python3.6/pickle.py", line 852, in _batch_setitems
save(v)
File "/usr/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 496, in save_function
self.save_function_tuple(obj)
File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 730, in save_function_tuple
save(state)
File "/usr/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python3.6/pickle.py", line 821, in save_dict
self._batch_setitems(obj.items())
File "/usr/lib/python3.6/pickle.py", line 847, in _batch_setitems
save(v)
File "/usr/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python3.6/pickle.py", line 821, in save_dict
self._batch_setitems(obj.items())
File "/usr/lib/python3.6/pickle.py", line 852, in _batch_setitems
save(v)
File "/usr/lib/python3.6/pickle.py", line 521, in save
self.save_reduce(obj=obj, *rv)
File "/usr/lib/python3.6/pickle.py", line 634, in save_reduce
save(state)
File "/usr/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python3.6/pickle.py", line 821, in save_dict
self._batch_setitems(obj.items())
File "/usr/lib/python3.6/pickle.py", line 847, in _batch_setitems
save(v)
File "/usr/lib/python3.6/pickle.py", line 521, in save
self.save_reduce(obj=obj, *rv)
File "/usr/lib/python3.6/pickle.py", line 605, in save_reduce
save(cls)
File "/usr/lib/python3.6/pickle.py", line 490, in save
self.save_global(obj)
File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 850, in save_global
return self.save_dynamic_class(obj)
File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 662, in save_dynamic_class
obj=obj)
File "/usr/lib/python3.6/pickle.py", line 610, in save_reduce
save(args)
File "/usr/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python3.6/pickle.py", line 751, in save_tuple
save(element)
File "/usr/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python3.6/pickle.py", line 736, in save_tuple
save(element)
File "/usr/lib/python3.6/pickle.py", line 490, in save
self.save_global(obj)
File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 850, in save_global
return self.save_dynamic_class(obj)
File "/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/cloudpickle.py", line 666, in save_dynamic_class
save(clsdict)
File "/usr/lib/python3.6/pickle.py", line 476, in save
f(self, obj) # Call unbound method with explicit self
File "/usr/lib/python3.6/pickle.py", line 821, in save_dict
self._batch_setitems(obj.items())
File "/usr/lib/python3.6/pickle.py", line 847, in _batch_setitems
save(v)
File "/usr/lib/python3.6/pickle.py", line 496, in save
rv = reduce(self.proto)
TypeError: can't pickle _jpype._JMethod objects
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
/usr/lib/python3.6/pickle.py in save_global(self, obj, name)
915 try:
--> 916 __import__(module_name, level=0)
917 module = sys.modules[module_name]
ModuleNotFoundError: No module named 'edu'
During handling of the above exception, another exception occurred:
PicklingError Traceback (most recent call last)
65 frames
PicklingError: Can't pickle <java class 'edu.stanford.nlp.python.SUTimeWrapper'>: it's not found as edu.stanford.nlp.python.edu.stanford.nlp.python.SUTimeWrapper
During handling of the above exception, another exception occurred:
ModuleNotFoundError Traceback (most recent call last)
ModuleNotFoundError: No module named 'java'
During handling of the above exception, another exception occurred:
PicklingError Traceback (most recent call last)
PicklingError: Can't pickle <java class 'java.lang.Object'>: it's not found as java.lang.java.lang.Object
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
TypeError: can't pickle _jpype._JMethod objects
During handling of the above exception, another exception occurred:
PicklingError Traceback (most recent call last)
/content/spark-3.0.1-bin-hadoop3.2/python/pyspark/serializers.py in dumps(self, obj)
476 msg = "Could not serialize object: %s: %s" % (e.__class__.__name__, emsg)
477 print_exec(sys.stderr)
--> 478 raise pickle.PicklingError(msg)
479
480
PicklingError: Could not serialize object: TypeError: can't pickle _jpype._JMethod objects
尽管我尝试使用代码显式调用convert1函数:
rr=rdd.take(10)
for i in range(10):
x=convert1(rr[i])
print(x)
上面的代码对我来说很好。它不适用于flatMap。
如有需要,请询问必要的详细信息。