如何对SparkR数据框进行子集化

时间:2015-07-25 11:26:23

标签: r apache-spark sparkr

假设我们有一个数据集'people',其中包含ID和Age作为2倍3矩阵。

import java.io.IOException;
import java.io.OutputStreamWriter;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Scanner;
import java.util.Set;

public class NetworkStuff1 {

    private static List <Set<Integer>> nodes ;

    private static boolean isIn (int a, int b) {
        return nodes.get(a).contains(b) ;
    }

    private static void addConnection (int a, int b) {
        Set <Integer> setB = nodes.get(b) ;
        nodes.get(a).addAll (setB) ;
        for (int i = 0 ; i < nodes.size() ; ++i) {
            if (nodes.get(i) == setB) {
                nodes.set(i, nodes.get(a)) ;
            }
        }
    }

    public static void solve (Scanner in, OutputStreamWriter out) throws IOException {
        int N = Integer.parseInt (in.nextLine()) ;
        // Nodes number goes from 1 to N, not 0 to N - 1, but I don't want to deal with this
        // so I add a useless 0 cell in my array.
        nodes = new ArrayList <Set <Integer>> () ;
        for (int i = 0 ; i < N + 1 ; ++i) {
            Set <Integer> s = new HashSet <Integer> () ;
            s.add(i) ;
            nodes.add (s) ;
        }
        while (true) {
            String[] tmp = in.nextLine().split(" ") ;
            if (tmp[0].equals("-1")) {
                break ;
            }
            if (tmp[0].equals("C")) {
                addConnection(Integer.parseInt (tmp[1]), Integer.parseInt (tmp[2])) ;
            }
            else if (tmp[0].equals("Q")){
                if (isIn (Integer.parseInt (tmp[1]), Integer.parseInt (tmp[2]))) {
                    System.out.println("Yes") ;
                }
                else {
                    System.out.println("No") ;
                }
            }
        }
    }

    /**
     * @param args
     * @throws IOException 
     */
    public static void main(String[] args) throws IOException {
        solve (new Scanner(System.in), new OutputStreamWriter(System.out)) ;
    }

}

在sparkR中,我想创建一个新数据集Id = 1 2 3 Age= 21 18 30 ,其中包含所有早于18岁的ID。在这种情况下,它的ID为1和3.在sparkR中我会这样做

people2

但它不起作用。您将如何创建新数据集?

2 个答案:

答案 0 :(得分:2)

您可以将SparkR::filter用于以下任一条件:

> people <- createDataFrame(sqlContext, data.frame(Id=1:3, Age=c(21, 18, 30)))
> filter(people, people$Age > 18) %>% head()

  Id Age
1  1  21
2  3  30

或SQL字符串:

> filter(people, "Age > 18") %>% head()

  Id Age
1  1  21
2  3  30

还可以在已注册的表上使用SparkR::sql函数和原始SQL查询:

> registerTempTable(people, "people"
> sql(sqlContext, "SELECT * FROM people WHERE Age > 18") %>% head()
  Id Age
1  1  21
2  3  30

答案 1 :(得分:1)

对于那些欣赏R执行任何给定任务的众多选项的人,您还可以使用SparkR :: subset()函数:

res.entity.dataBytes

要回答评论中的其他细节:

> people <- createDataFrame(sqlContext, data.frame(Id=1:3, Age=c(21, 18, 30)))
> people2 <- subset(people, people$Age > 18, select = c(1,2))
> head(people2)
  Id Age
1  1  21
2  3  30