根据相同的ID合并pyspark数据框行

时间:2019-01-10 19:31:42

标签: apache-spark group-by pyspark apache-spark-sql

我想展平hbase数据集。个人在其他表中有2种数据类型,邮件和电话。在rel_id = data_id上加入他们之后,我得到了一个人的2条独立记录。

    |individual_id|rel_id|data_id|       mail|phone|role
    ---------------------------------------------------------
    |            1|   100|    100|a@gmail.com| null|secondary
    |            1|   200|    200|       null|  123|primary
    |            2|   300|    300|       null|  345|secondary
    |            2|   400|    400|b@gmail.com| null|primary
    ------------------------------------------------

是否可以通过某种方式合并相同的personal_id的邮件和电话列,并为每个数据类型角色创建新列,这样我就可以得到像这样的数据:

    |individual_id|       mail|phone|mail_role |phone_role
    ------------------------------------------------------
    |            1|a@gmail.com|  123| secondary|primary
    |            2|b@gmail.com|  345|   primary|secondary
    -----------------------------------------------------

我将非常感谢您的帮助。

1 个答案:

答案 0 :(得分:1)

您可以按import React, { Component } from 'react'; import { connect } from 'react-redux'; import {decorator as reduxBurgerMenu} from 'redux-burger-menu'; import { bubble as Menu } from "react-burger-menu"; const ReduxBurgerMenu = reduxBurgerMenu(Menu); class Sidebar extends Component { state = { menu: { title: "Menu title", description: "Menu description", products: { title: "Products title", description: "Product description", product: [ { device: { title: "Device title", description: "Device description", types: [ { name: "Modem", title: "Modem title", description: "Modem description" }, { name: "charger", title: "charger title", description: "charger description" } ] } }, { laptop: { title: "Laptop title", description: "Laptop description", types: [ { name: "Apple", title: "Apple title", description: "Apple description" }, { name: "Lenevo", title: "Lenevo title", description: "Lenevo description" } ] } }, { mobile: { title: "Mobile title", description: "Mobile description", types: [ { name: "Samsung", title: "Samsung title", description: "Samsung description" }, { name: "Nokia", title: "Nokia title", description: "Nokia description" } ] } } ] } } } render() { return ( <ReduxBurgerMenu isOpen={ this.props.isOpen }> <a className="menu-item" href="/"> <div> { this.state.menu } </div> </a> <a className="menu-item" href="/burgers"> Burgers </a> <a className="menu-item" href="/pizzas"> Pizzas </a> <a className="menu-item" href="/desserts"> Desserts </a> </ReduxBurgerMenu> ) } } // const mapStateToProps = state => { // return { // title: state.menu.title, // description: state.menu.description, // products: state.menu.products, // product: state.menu.products.product // }; // }; // mapDispatchToProps = dispatch => { // return { // }; // }; export default connect()(Sidebar); { "menu": { "title ": "Menu title", "description": "Menu description", "products": { "title ": "Products title", "description": "Product description", "product": [ { "device": { "title ": "Device title", "description": "Device description", "types": [ { "name": "Modem", "title ": "Modem title", "description": "Modem description" }, { "name": "charger", "title ": "charger title", "description": "charger description" } ] } }, { "laptop": { "title ": "Laptop title", "description": "Laptop description", "types": [ { "name": "Apple", "title ": "Apple title", "description": "Apple description" }, { "name": "Lenevo", "title ": "Lenevo title", "description": "Lenevo description" } ] } }, { "mobile": { "title ": "Mobile title", "description": "Mobile description", "types": [ { "name": "Samsung", "title ": "Samsung title", "description": "Samsung description" }, { "name": "Nokia", "title ": "Nokia title", "description": "Nokia description" } ] } } ] } } } 分组,并使用pyspark.sql.functions.first作为汇总功能。您必须将individual_id的{​​{1}}参数设置为ignorenulls

first

对于更新后的问题,您可以使用pyspark.sql.functions.when

True