如何将spark sql DF行写成两行到s3

时间:2017-04-08 20:27:03

标签: amazon-s3 apache-spark-sql spark-dataframe

DataFrame:

getNumber(1)

如何编写以

的形式写入s3文件
public static List<Point> getOnlyOnSegmentPoints(Point segment_first_point, Point segment_last_point, List<Point> points_to_test, double maximal_allowed_distance) {
    System.out.println("=== Segment : " + segment_first_point + " - " + segment_last_point);
    double segment_first_point_x = segment_first_point.getNumber(0);
    double segment_first_point_y = segment_first_point.getNumber(1);
    double segment_last_point_x = segment_last_point.getNumber(0);
    double segment_last_point_y = segment_last_point.getNumber(1);
    double test_x, test_y;
    double k_numerator, k_denominator;

    Point p;
    List<String> coords_p = new ArrayList<>();

    List<Point> returned = new ArrayList<>(points_to_test);

    for(Point point_to_test : points_to_test) {

        if(point_to_test == segment_first_point || point_to_test == segment_last_point) {
            continue;
        }

        test_x = point_to_test.getNumber(0);
        test_y = point_to_test.getNumber(1);

        // k = ((x - a).(b - a))/((b - a).(b - a))
        k_numerator = (test_x - segment_first_point_x) * (segment_last_point_x - segment_first_point_x)
                + (test_y - segment_first_point_y) * (segment_last_point_y - segment_first_point_y);

        k_denominator = (segment_last_point_x - segment_first_point_x) * (segment_last_point_x - segment_first_point_x)
                + (segment_last_point_y - segment_first_point_y) * (segment_last_point_y - segment_first_point_y);

        // p = ((x - a).(b - a))/((b - a).(b - a)) (b - a) + a
        coords_p.add(
                ""
                        + (

                        ((test_x - segment_first_point_x) * (segment_last_point_x - segment_first_point_x)) // "((x - a).(b - a))"
                                /
                                (0.00001+ // "((b - a).(b - a))"
                                        (segment_last_point_x - segment_first_point_x) * (segment_last_point_x - segment_first_point_x)

                                )

                                *
                                (segment_last_point_x - segment_first_point_x) // "* (b - a)"

                        +

                        segment_first_point_x) // " + a"
        );
        coords_p.add(
                ""
                        + (

                        ((test_y - segment_first_point_y) * (segment_last_point_y - segment_first_point_y)) // "((x - a).(b - a))"
                                /
                                (0.00001+ // "((b - a).(b - a))"
                                        (segment_last_point_y - segment_first_point_y) * (segment_last_point_y - segment_first_point_y)

                                )

                                *

                                (segment_last_point_y - segment_first_point_y) // "* (b - a)"

                        +

                        segment_first_point_y) // " + a"

        );
        p = new Point(coords_p);

        if(k_numerator/k_denominator < 0 && EuclidianFilters.distanceBetweenTwoPoints(point_to_test, segment_first_point) > maximal_allowed_distance) {
            returned.remove(point_to_test);
            System.out.println("------> Point removed x-a : " + point_to_test);

        } else if(k_numerator/k_denominator >= 0 && k_numerator/k_denominator <= 1 && EuclidianFilters.distanceBetweenTwoPoints(point_to_test, p) > maximal_allowed_distance) {
            returned.remove(point_to_test);
            System.out.println("------> Point removed x-p : " + point_to_test);

        } else if(k_numerator/k_denominator > 1 &&  EuclidianFilters.distanceBetweenTwoPoints(point_to_test, segment_last_point) > maximal_allowed_distance) {
            returned.remove(point_to_test);
            System.out.println("------> Point removed x-b : " + point_to_test);
        }

    }

    return returned;
}

1 个答案:

答案 0 :(得分:0)

为了将Spark SQL DataFrame写入2行到S3,您必须map将每行DF放入相应的字符串中并使用新行\n

val df = sc.parallelize(Seq(("Ravi","Computers",20),("Jon","Electronics",21),
("Sam","arts",20))).toDF

df.map(r => s"Index:${r.getString(0)}\n${r.getString(0)} ${r.getString(1)} ${r.getInt(2)}").write.csv("s3n://........")

它会将DF写入预期的输出格式:

       Line1: Index:Ravi
       Line2: Ravi  Computers   20
       Line3: Index:Jon
       Line4: Jon     Electronics 21
       Line5: Index:Sam
       Line6: Sam     arts        20