我在cassandra中有以下地图格式的columnfamily,我想使用Spark DataSet进行处理。因此,我想将模型值分为溢价(City and Duster
)和非溢价(Alto K10, Aspire, Nano and i10
)两类,我想要溢价与非溢价的最终计数值为2( City
和Duster
计数)与10(Alto K10, Aspire, Nano and i10
)。
代码:
case class UserProfile(userdata:Map[String,Map[String,Int]])
val userprofileDataSet = spark.read.format("org.apache.spark.sql.cassandra").options(Map("table"->"userprofilesagg","keyspace" -> "KEYSPACENAME")).load().as[UserProfile]
如何处理userprofileDataSet ??
数据格式:
{'bodystyle': {'Compact Sedan': 1, 'Hatchback': 8, 'SUV': 1, 'Sedan': 4},
'models': {'Alto K10': 3, 'Aspire': 4, 'City': 1, 'Duster': 1, 'Nano': 3, 'i10': 2}}
编辑问题:
关于鱿鱼的答案。我想像现在这样聚合每个用户的结果:
DOICvncGKUH9xBLnW3e9jXcd2 | non-premium | [Nano, Alto K10, Aspire, i10] | 12 | premium | [City, Duster] | 2
BkkpgeAdCkYJEXsdZjiVz3bSb | non-premium | [Nano, Alto K10, Aspire, i10] | 17 | premium | [City, Duster] | 5
现在案例类看起来像这样
案例类:
case class UserProfile(userid:String, userdata:Map[String,Map[String,Int]])
数据:
DOICvncGKUH9xBLnW3e9jXcd2 | {'bodystyle': {'Compact Sedan': 1, 'Hatchback': 8, 'SUV': 1, 'Sedan': 4},
'models': {'Alto K10': 3, 'Aspire': 4, 'City': 1, 'Duster': 1, 'Nano': 3, 'i10': 2}}
BkkpgeAdCkYJEXsdZjiVz3bSb | {'bodystyle': {'Compact Sedan': 7, 'Hatchback': 5, 'SUV': 3, 'Sedan': 7},
'models': {'Alto K10': 1, 'Aspire': 7, 'City': 4, 'Duster': 1, 'Nano': 8, 'i10': 1}}
此外,你问我为什么提到Bodystyle。因此,我可以将类似的聚合(SUV, Sedan)
应用为溢价,并在其上使用非溢价。
答案 0 :(得分:1)
我不确定bodystyle
的作用究竟是什么。如果我正确理解了问题,那么您需要类别和计数,您可以尝试下面的内容,如果没有用,请删除types
:
--userprofile table
CREATE TABLE `userprofile`(
`properties` map<string,map<string,int>>);
--Aggregate by category
select category,
collect_set(type) as types,
sum(value) as count
from (select case when lower(type) in ('city','duster') then 'premium'
when lower(type) in ('alto k10', 'aspire', 'nano' , 'i10') then 'non-premium'
end as category,
type,value
from (select properties['models'] as models from userprofile) t
lateral view explode(models) t as type, value)l group by category
<强>输出强>
category | types | count
non-premium | ["Aspire","i10","Nano","Alto K10"] | 12
premium | ["City","Duster"] | 2