修改:已解决 我认为问题出在Elmo推理生成的多维数组。我对所有向量求平均值,然后将句子中所有单词的最终平均向量用作输出,现在可以将其转换为数据帧。现在,我必须使其更快,将再次使用线程。
尝试使用来自以下github的ElmoForManyLangs预训练模型为pyspark数据框中的句子生成Elmo嵌入。但是,我无法将结果对象转换为数据框。
https://github.com/HIT-SCIR/ELMoForManyLangs
import sys
from pyspark.sql.functions import split
import pandas as pd
import numpy as np
from pyspark.sql.functions import trim
sys.path.append('/tmp/python-elmo/elmoManyLangs/elmoManyLangsGit/ELMoForManyLangs-master')
from elmoformanylangs import Embedder
e = Embedder('/mnt/tmp/python-elmo/elmoManyLangs/english/')
new_list = []
input = spark.read.parquet("/path/to/input/file")
words = input.withColumn("wordlist", split(trim(input["description"]), " ")).dropna().select("product_name","wordlist").limit(1)
wordsPd=words.toPandas()
for t in wordsPd.itertuples():
new_list.append(np.average(np.array([np.average(x,axis=0) for x in e.sents2elmo(t[2])]), axis=0).tolist())
wordsPd = wordsPd.assign(embeddings=new_list)
myDf = spark.createDataFrame(wordsPd)
myDf.registerTempTable("myDf")
wordsPd
0 my_product_name ... 0 [[0.1606223,0.09298285,-0.3494971,0.2 ... [1行x 3列]
wordsPd.dtypes
product_name object
description object
embeddings object
dtype: object
这是创建数据框的错误。
Traceback (most recent call last):
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1068, in _infer_type
return _infer_schema(obj)
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1094, in _infer_schema
raise TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can not infer schema for type: <class 'object'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1068, in _infer_type
return _infer_schema(obj)
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1096, in _infer_schema
fields = [StructField(k, _infer_type(v), True) for k, v in items]
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1096, in <listcomp>
fields = [StructField(k, _infer_type(v), True) for k, v in items]
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1070, in _infer_type
raise TypeError("not supported type: %s" % type(obj))
TypeError: not supported type: <class 'object'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1068, in _infer_type
return _infer_schema(obj)
.........
.........
raise TypeError("not supported type: %s" % type(obj))
TypeError: not supported type: <class 'pandas.core.indexes.range.RangeIndex'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-7355529425587840217.py", line 360, in <module>
exec(code, _zcUserQueryNameSpace)
...........
...........
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1070, in _infer_type
raise TypeError("not supported type: %s" % type(obj))
TypeError: not supported type: <class 'pandas.core.series.Series'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-7355529425587840217.py", line 367, in <module>
raise Exception(traceback.format_exc())
.........
.........
raise TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can not infer schema for type: <class 'object'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1068, in _infer_type
return _infer_schema(obj)
.........
.........
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1070, in _infer_type
raise TypeError("not supported type: %s" % type(obj))
TypeError: not supported type: <class 'object'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1068, in _infer_type
return _infer_schema(obj)
........
........
........
........
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1070, in _infer_type
raise TypeError("not supported type: %s" % type(obj))
TypeError: not supported type: <class 'pandas.core.indexes.range.RangeIndex'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-7355529425587840217.py", line 360, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 17, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 691, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
........
........
........
........
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1070, in _infer_type
raise TypeError("not supported type: %s" % type(obj))
TypeError: not supported type: <class 'pandas.core.series.Series'>
答案 0 :(得分:0)
我需要使用以下方法来聚合向量,这使得多维数组成为一个列表。
for t in wordsPd.itertuples():
new_list.append(np.average(np.array([np.average(x,axis=0) for x in e.sents2elmo(t[2])]), axis=0).tolist())