Spark DataFrame如何区分不同的VectorUDT对象?

时间:2016-07-31 02:33:18

标签: apache-spark dataframe pyspark apache-spark-mllib apache-spark-ml

我试图了解DataFrame列类型。当然,DataFrame不是物化对象,它只是Spark的一组指令,将来转换为代码。但我想这个类型列表表示在执行操作时可能在JVM内部实现的对象类型。

import pyspark
import pyspark.sql.types as T
import pyspark.sql.functions as F
data = [0, 3, 0, 4]
d = {}
d['DenseVector'] = pyspark.ml.linalg.DenseVector(data)
d['old_DenseVector'] = pyspark.mllib.linalg.DenseVector(data)
d['SparseVector'] = pyspark.ml.linalg.SparseVector(4, dict(enumerate(data)))
d['old_SparseVector'] = pyspark.mllib.linalg.SparseVector(4, dict(enumerate(data)))
df = spark.createDataFrame([d])
df.printSchema()

printSchema()(或schema)中四个矢量值的列看起来相同:

root
 |-- DenseVector: vector (nullable = true)
 |-- SparseVector: vector (nullable = true)
 |-- old_DenseVector: vector (nullable = true)
 |-- old_SparseVector: vector (nullable = true)

但是当我逐行检索它们时,它们会变得不同:

> for x in df.first().asDict().items():
  print(x[0], type(x[1]))
(2) Spark Jobs
old_SparseVector <class 'pyspark.mllib.linalg.SparseVector'>
SparseVector <class 'pyspark.ml.linalg.SparseVector'>
old_DenseVector <class 'pyspark.mllib.linalg.DenseVector'>
DenseVector <class 'pyspark.ml.linalg.DenseVector'>

为了编写UDF,我对vector类型(相当于VectorUDT的含义)感到困惑。 DataFrame如何知道每个vector列中的四种向量类型中的哪一种?这些向量列中的数据是存储在JVM中还是存储在python VM中?如果VectorUDT中的DataFrame不是官方类型listed here之一,mllib.linalg如何存储?

(我知道来自Public Function ExtractOutlookTables(objMailItem As Object) As Object Dim vTable As Variant Dim objHTML As Object: Set objHTML = CreateObject("htmlFile") Dim objEleCol As Object objHTML.Body.innerHTML = objMailItem.HTMLBody ' <<error line>> With objHTML objHTML.Body.innerHTML = objMailItem.HTMLBody Set objEleCol = .getElementsByTagName("table") End With 'import in Excel Dim x As Long, y As Long For x = 0 To objEleCol(0).Rows.Length - 1 For y = 0 To objEleCol(0).Rows(x).Cells.Length - 1 vTable(x, y) = objEleCol(0).Rows(x).Cells(y).innerText Next y Next x ErrorHandler: Set objHTML = Nothing: Set objEleCol = Nothing End Function '' ' Function that returns a dictionary of arrays of strings, each representing a table in the email; key = 0 represents the most recent table ' @param objMailItem object representing an Outlook Mail Item object ' @return Dictionary of arrays of strings where each key represents the index of the table (0 being the most recent table) ' @remarks Please note that index 0 = table in the most recent email conversation ' @see none Public Function fnc_ExtractTablesFromMailItem(objMailItem As Object) As Object Dim objHTMLDoc As Object: Set objHTMLDoc = CreateObject("HTMLFile") Dim dicTables As Object: Set dicTables = CreateObject("scripting.Dictionary") Dim arrTable() As String Dim objTable As Object Dim lngRow As Long Dim lngCol As Long Dim intCounter As Integer: intCounter = 0 objHTMLDoc.body.innerHTML = objMailItem.htmlbody ' Loop through each table in email For Each objTable In objHTMLDoc.getElementsByTagName("table") ReDim arrTable(objTable.Rows.Length - 1, objTable.Rows(1).Cells.Length - 1) For lngRow = 0 To objTable.Rows.Length - 1 Set rw = objTable.Rows(lngRow) For lngCol = 0 To rw.Cells.Length - 1 ' Ignore any problems with merged cells etc On Error Resume Next arrTable(lngRow, lngCol) = rw.Cells(lngCol).innerText ' Store each table in 1 array On Error GoTo 0 Next lngCol Next lngRow dicTables(intCounter) = arrTable ' Store each array as a dictionary item intCounter = intCounter + 1 Next objTable Set fnc_ExtractTablesFromMailItem = dicTables ' Garbage collection Set dicTables = Nothing: Set objTable = Nothing: Set objHTMLDoc = Nothing End Function 的四种矢量类型中的两种最终将被弃用。)

1 个答案:

答案 0 :(得分:7)

  

如何将VectorUDT存储在DataFrame中

UDT a.k.a用户定义的类型应该是一个提示。 Spark提供(现在是私有的)机制来在DataFrame中存储自定义对象。您可以查看我对How to define schema for custom type in Spark SQL?或Spark源的回答以获取详细信息,但总而言之,它只是解构对象并将它们编码为Catalyst类型。

  

我对矢量类型的含义感到困惑

很可能是因为你在看错了什么。简短描述很有用,但它并不能确定类型。相反,您应该检查架构。让我们创建另一个数据框:

import pyspark.mllib.linalg as mllib
import pyspark.ml.linalg as ml

df = sc.parallelize([
    (mllib.DenseVector([1, ]), ml.DenseVector([1, ])),
    (mllib.SparseVector(1, [0, ], [1, ]), ml.SparseVector(1, [0, ], [1, ]))
]).toDF(["mllib_v", "ml_v"])

df.show()

## +-------------+-------------+
## |      mllib_v|         ml_v|
## +-------------+-------------+
## |        [1.0]|        [1.0]|
## |(1,[0],[1.0])|(1,[0],[1.0])|
## +-------------+-------------+

并检查数据类型:

{s.name: type(s.dataType) for s in df.schema}

## {'ml_v': pyspark.ml.linalg.VectorUDT,
##  'mllib_v': pyspark.mllib.linalg.VectorUDT}

正如您所看到的,UDT类型是完全合格的,因此这里没有混淆。

  

DataFrame如何知道它在每个向量列中的四种向量类型中的哪一种?

如上所示DataFrame只知道其架构,可以区分ml / mllib类型,但不关心矢量变量(稀疏或密集)。

矢量类型由其type字段(byte字段,0 - &gt;稀疏,1 - >密集)确定,但整体架构相同。此外,mlmllib之间的内部代表也没有区别。

  

这些向量列中的数据是存储在JVM还是Python

DataFrame是一个纯JVM实体。 Python互操作性是通过耦合的UDT类实现的:

  • Scala UDT可能会定义pyUDT属性。
  • Python UDT可以定义scalaUDT属性。