我试图了解DataFrame列类型。当然,DataFrame不是物化对象,它只是Spark的一组指令,将来转换为代码。但我想这个类型列表表示在执行操作时可能在JVM内部实现的对象类型。
import pyspark
import pyspark.sql.types as T
import pyspark.sql.functions as F
data = [0, 3, 0, 4]
d = {}
d['DenseVector'] = pyspark.ml.linalg.DenseVector(data)
d['old_DenseVector'] = pyspark.mllib.linalg.DenseVector(data)
d['SparseVector'] = pyspark.ml.linalg.SparseVector(4, dict(enumerate(data)))
d['old_SparseVector'] = pyspark.mllib.linalg.SparseVector(4, dict(enumerate(data)))
df = spark.createDataFrame([d])
df.printSchema()
printSchema()
(或schema
)中四个矢量值的列看起来相同:
root
|-- DenseVector: vector (nullable = true)
|-- SparseVector: vector (nullable = true)
|-- old_DenseVector: vector (nullable = true)
|-- old_SparseVector: vector (nullable = true)
但是当我逐行检索它们时,它们会变得不同:
> for x in df.first().asDict().items():
print(x[0], type(x[1]))
(2) Spark Jobs
old_SparseVector <class 'pyspark.mllib.linalg.SparseVector'>
SparseVector <class 'pyspark.ml.linalg.SparseVector'>
old_DenseVector <class 'pyspark.mllib.linalg.DenseVector'>
DenseVector <class 'pyspark.ml.linalg.DenseVector'>
为了编写UDF,我对vector
类型(相当于VectorUDT
的含义)感到困惑。 DataFrame
如何知道每个vector
列中的四种向量类型中的哪一种?这些向量列中的数据是存储在JVM中还是存储在python VM中?如果VectorUDT
中的DataFrame
不是官方类型listed here之一,mllib.linalg
如何存储?
(我知道来自Public Function ExtractOutlookTables(objMailItem As Object) As Object
Dim vTable As Variant
Dim objHTML As Object: Set objHTML = CreateObject("htmlFile")
Dim objEleCol As Object
objHTML.Body.innerHTML = objMailItem.HTMLBody ' <<error line>>
With objHTML
objHTML.Body.innerHTML = objMailItem.HTMLBody
Set objEleCol = .getElementsByTagName("table")
End With
'import in Excel
Dim x As Long, y As Long
For x = 0 To objEleCol(0).Rows.Length - 1
For y = 0 To objEleCol(0).Rows(x).Cells.Length - 1
vTable(x, y) = objEleCol(0).Rows(x).Cells(y).innerText
Next y
Next x
ErrorHandler:
Set objHTML = Nothing: Set objEleCol = Nothing
End Function
''
' Function that returns a dictionary of arrays of strings, each representing a table in the email; key = 0 represents the most recent table
' @param objMailItem object representing an Outlook Mail Item object
' @return Dictionary of arrays of strings where each key represents the index of the table (0 being the most recent table)
' @remarks Please note that index 0 = table in the most recent email conversation
' @see none
Public Function fnc_ExtractTablesFromMailItem(objMailItem As Object) As Object
Dim objHTMLDoc As Object: Set objHTMLDoc = CreateObject("HTMLFile")
Dim dicTables As Object: Set dicTables = CreateObject("scripting.Dictionary")
Dim arrTable() As String
Dim objTable As Object
Dim lngRow As Long
Dim lngCol As Long
Dim intCounter As Integer: intCounter = 0
objHTMLDoc.body.innerHTML = objMailItem.htmlbody
' Loop through each table in email
For Each objTable In objHTMLDoc.getElementsByTagName("table")
ReDim arrTable(objTable.Rows.Length - 1, objTable.Rows(1).Cells.Length - 1)
For lngRow = 0 To objTable.Rows.Length - 1
Set rw = objTable.Rows(lngRow)
For lngCol = 0 To rw.Cells.Length - 1
' Ignore any problems with merged cells etc
On Error Resume Next
arrTable(lngRow, lngCol) = rw.Cells(lngCol).innerText ' Store each table in 1 array
On Error GoTo 0
Next lngCol
Next lngRow
dicTables(intCounter) = arrTable ' Store each array as a dictionary item
intCounter = intCounter + 1
Next objTable
Set fnc_ExtractTablesFromMailItem = dicTables
' Garbage collection
Set dicTables = Nothing: Set objTable = Nothing: Set objHTMLDoc = Nothing
End Function
的四种矢量类型中的两种最终将被弃用。)
答案 0 :(得分:7)
如何将VectorUDT存储在DataFrame中
UDT
a.k.a用户定义的类型应该是一个提示。 Spark提供(现在是私有的)机制来在DataFrame
中存储自定义对象。您可以查看我对How to define schema for custom type in Spark SQL?或Spark源的回答以获取详细信息,但总而言之,它只是解构对象并将它们编码为Catalyst类型。
我对矢量类型的含义感到困惑
很可能是因为你在看错了什么。简短描述很有用,但它并不能确定类型。相反,您应该检查架构。让我们创建另一个数据框:
import pyspark.mllib.linalg as mllib
import pyspark.ml.linalg as ml
df = sc.parallelize([
(mllib.DenseVector([1, ]), ml.DenseVector([1, ])),
(mllib.SparseVector(1, [0, ], [1, ]), ml.SparseVector(1, [0, ], [1, ]))
]).toDF(["mllib_v", "ml_v"])
df.show()
## +-------------+-------------+
## | mllib_v| ml_v|
## +-------------+-------------+
## | [1.0]| [1.0]|
## |(1,[0],[1.0])|(1,[0],[1.0])|
## +-------------+-------------+
并检查数据类型:
{s.name: type(s.dataType) for s in df.schema}
## {'ml_v': pyspark.ml.linalg.VectorUDT,
## 'mllib_v': pyspark.mllib.linalg.VectorUDT}
正如您所看到的,UDT类型是完全合格的,因此这里没有混淆。
DataFrame如何知道它在每个向量列中的四种向量类型中的哪一种?
如上所示DataFrame
只知道其架构,可以区分ml
/ mllib
类型,但不关心矢量变量(稀疏或密集)。
矢量类型由其type
字段(byte
字段,0 - &gt;稀疏,1 - >密集)确定,但整体架构相同。此外,ml
和mllib
之间的内部代表也没有区别。
这些向量列中的数据是存储在JVM还是Python
中
DataFrame
是一个纯JVM实体。 Python互操作性是通过耦合的UDT类实现的:
pyUDT
属性。scalaUDT
属性。