Pyspark / Dataframe:添加新列,将嵌套列表保存为嵌套列表

时间:2017-06-27 10:47:56

标签: python dataframe format pyspark

我有一个关于数据帧的基本问题,并添加了一个应该包含嵌套列表的列。这基本上就是问题所在:

b = [[['url.de'],['name']],[['url2.de'],['name2']]]

a = sc.parallelize(b)
a = a.map(lambda p: Row(URL=p[0],name=p[1]))
df = sqlContext.createDataFrame(a)

list1 = [[['a','s', 'o'],['hallo','ti']],[['a','s', 'o'],['hallo','ti']]]
c = [b[0] + [list1[0]],b[1] + [list1[1]]]

#Output looks like this:
[[['url.de'], ['name'], [['a', 's', 'o'], ['hallo', 'ti']]], 
 [['url2.de'], ['name2'], [['a', 's', 'o'], ['hallo', 'ti']]]]

要从此输出创建新的Dataframe,我正在尝试创建新架构:

schema = df.withColumn('NewColumn',array(lit("10"))).schema

然后我用它来创建新的DataFrame:

df = sqlContext.createDataFrame(c,schema)
df.map(lambda x: x).collect()

#Output
[Row(URL=[u'url.de'], name=[u'name'], NewColumn=[u'[a, s, o]', u'[hallo, ti]']),
 Row(URL=[u'url2.de'], name=[u'name2'], NewColumn=[u'[a, s, o]', u'[hallo, ti]'])]

现在的问题是,嵌套列表被转换为一个包含两个unicode条目的列表,而不是保留原始格式。

我认为这是由于我对新列的定义“...数组(点亮(”10“))”。

为了保持原始格式,我必须使用什么?

1 个答案:

答案 0 :(得分:1)

您可以通过调用df.schema直接检查数据框架的架构。您可以看到在给定方案中我们有以下内容:

StructType(
  List(
    StructField(URL,ArrayType(StringType,true),true),
    StructField(name,ArrayType(StringType,true),true),
    StructField(NewColumn,ArrayType(StringType,false),false)
  )
)

您添加的NewColumnArrayType列,其条目均为StringType。因此,数组中包含的任何内容都将转换为字符串,即使它本身就是一个数组。如果要使用嵌套数组(2层),则需要更改架构,以使NewColumn字段具有ArrayType(ArrayType(StringType,False),False)类型。您可以通过显式定义架构来完成此操作:

from pyspark.sql.types import StructType, StructField, ArrayType, StringType

schema = StructType([
    StructField("URL", ArrayType(StringType(),True), True),
    StructField("name", ArrayType(StringType(),True), True),
    StructField("NewColumn", ArrayType(ArrayType(StringType(),False),False), False)])

或者通过嵌套NewColumn函数array来定义array(array())来更改代码,df.withColumn('NewColumn',array(array(lit("10")))).schema

$username = "username_here"; //Your API username
$password = "password_here";  //your API password
$client = new SoapClient("http://apiconnector.com/api.asmx?WSDL"); 
//Instantiate the Soap client
$addressbookid=id_here;

$email = $_POST["email"];   
$Expo2017 = $_POST["region"];
$AudienceType="B2B";
$OptInType="Single";
$EmailType="Html";

$keys = array("EXPO2017");
$var2 = new SoapVar($Expo2017,XSD_STRING,"string","http://www.w3.org/2001/XMLSchema");
$values = array($var2);
$Datafields = array ('Keys'=>$keys,'Values'=>$values);
$contact = array ("Email"=>$email,"AudienceType"=>$AudienceType,"OptInType"=>$OptInType,"EmailType"=>$EmailType,"ID"=>-1,"DataFields"=>$Datafields);
$params = array ("username"=>$username,"password"=>$password,"contact"=>$contact,"addressbookId"=>$addressbookid);
return $client->AddContactToAddressbook($params);

$success = "success";

// redirect to success page
if ($success){
   echo "success";
}else{
    echo "invalid";
}