具有多行记录的文本文件的Hive外部表定义

时间:2015-05-22 03:00:06

标签: regex hadoop hive

我需要将这个文件解析成一个hive表,这是一个来自亚马逊的电影评论数据集。我在构造正则表达式以解析.txt文件并使用正确的列类型创建表时遇到问题。

.txt

product/productId: B0001G6PZC
review/userId: A3F3THLLZXURQN
review/profileName: A. Y
review/helpfulness: 3/3
review/score: 4.0
review/time: 1199664000
review/summary: Good story, Good action. Good Drama. Good Movie
review/text: When I first heard of this movie, I didn't think it would be that great, so I never bothered to go see it in theaters. Later on, I ended up downloading the movie, and didn't think much of it.<br /><br />But now after watching the movie on BD, I think that the movie is quite outstanding. Its got a good story behind it, with some level of historical basis behind it with Samurai becoming phased out into Japan's modernization.<br /><br />It does a good job in immersing you into the conflicts that warriors must endure... and yet, find peace with the way of the Samurai as they are a warrior race and not savages.<br /><br />4/5 stars.

product/productId: B0001G6PZC
review/userId: A3J78KAIPW6KAH
review/profileName: Joan Paolo De Bastos "conde_almasy"
review/helpfulness: 3/3
review/score: 4.0
review/time: 1198540800
review/summary: Good Movie. Wonderful Visuals. A Great Way to SHOW OFF you Hi-Def System
review/text: Last Samurai is no masterpiece<br /><br />but technically it is<br /><br />the visuals, the sound effects, the music.<br /><br />If you want to show off to your friends what a great hi-def system you got, purchase this movie.<br /><br />If you want a classic, but lord of the rings or gone with the wind instead.

product/productId: B0001G6PZC
review/userId: A3F3B6HY9RJI04
review/profileName: James Duckett
review/helpfulness: 3/3
review/score: 5.0
review/time: 1192060800
review/summary: Great Movie, Fantastic HD Quality
review/text: After picking up my HD DVD player I've had troubles watching regular DVD movies.  I had heard some good things about this movie but couldn't pass it up once it was in high definition.<br /><br />The story is pretty good.  This is the story of Captain Algren who has been sent to Japan in the late 1800's in order to help them modernize the Japanese army as they go from fighting with swords and arrows to machine guns and cannons.<br /><br />After the "modern" Japanese army prematurely attacks the Samurai and lose horribly, Captain Algren is taken captive by the Samurai and introduced to their way of life and refusal to lay down the sword in the name of compliance.  In time, Captain Algren finds himself wanting to become one of the Samurai and learning more of their way of life.<br /><br />The story is pretty good but what raises this up to the level of being outstanding is the high definition quality of the movie.  It was fantastic, especially seeing the colorful Japanese landscape in all of its magnificence.<br /><br />If you like Tom Cruise action movies, this is one to pick up especially in high definition (whether it be Blu-Ray or HD DVD).  The violence can be extremely graphic (hey, this is war) so if you are sensitive to that you may want to look for something else.  Otherwise, the pacing of the movie is pretty good.  It isn't an all out gore-fest... there is action and then it breaks and lets you relax and catch up a little bit and then goes back to action and so on and so forth.

这是我的SQL:

CREATE EXTERNAL TABLE movies(id string, uId string, profileName string, helpfulness string, score float, time int, summary string, text string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' 
WITH serdeproperties( "input.regex" = "[ ].*", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s"")
location '/user/hduser/moviesTest';

然而,hive没有正确解析它:SELECT * FROM movies给了我这个结果:

NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL

谁能告诉我我做错了什么?

1 个答案:

答案 0 :(得分:1)

这可以通过Hive UDF轻松完成;

您的数据位于表格中,假设为单列,命名为line;

create table temp(line String);
load data local inpath 'review.txt' into table temp;
select line from temp;

roduct/productId: B0001G6PZC
review/userId: A3F3THLLZXURQN
review/profileName: A. Y
review/helpfulness: 3/3

review/score: 4.0
review/time: 1199664000
review/summary: Good story, Good action. Good Drama. Good Movie
review/text: When I first heard of this movie, I didn't think it would be that great, so I never bothered to go see it in theaters. Later on, I ended up downloading the movie, and didn't think much of it.<br /><br />But now after watching the movie on BD, I think that the movie is quite outstanding. Its got a good story behind it, with some level of historical basis behind it with Samurai becoming phased out into Japan's modernization.<br /><br />It does a good job in immersing you into the conflicts that warriors must endure... and yet, find peace with the way of the Samurai as they are a warrior race and not savages.<br /><br />4/5 stars.

product/productId: B0001G6PZC
review/userId: A3J78KAIPW6KAH
review/profileName: Joan Paolo De Bastos "conde_almasy"
review/helpfulness: 3/3
review/score: 4.0
review/time: 1198540800
............

............

在java中创建一个Hive Udf。来源在这里

package HiveUDF;

import org.apache.hadoop.hive.ql.exec.UDF;

public class ReviewDataUdf extends UDF {
    String s = " ";
    String structuredFormat;
    int inds = 0;
    int inde = 0;

    public String evaluate(String t) {
        s = s + " " + t;
        if (t.contains("review/text:")) {
            String productId = "";
            try {
                if (s.contains("product/productId:")) {
                    inds = s.indexOf("product/productId:");
                    inde = s.indexOf("review/userId:", inds);
                    productId = s.substring(inds + 18, inde);
                } else {
                    productId = "N/A";
                }

            } catch (Exception e) {
                productId = "";
            }
            String userId = "";
            try {
                if (s.contains("review/userId:")) {

                    inds = s.indexOf("review/userId:");
                    inde = s.indexOf("review/profileName:", inds);
                    userId = s.substring(inds + 14, inde);
                } else {
                    userId = "N/A";
                }
            } catch (Exception e) {
                userId = "";
            }

            String profileName = "";
            try {
                if (s.contains("review/profileName:")) {
                    inds = s.indexOf("review/profileName:");
                    inde = s.indexOf("review/helpfulness:", inds);
                    profileName = s.substring(inds + 19, inde);
                } else {
                    profileName = "N/A";
                }
            } catch (Exception e) {
                profileName = "";
            }

            String helpfulness = "";
            try {
                if (s.contains("review/helpfulness:")) {
                    inds = s.indexOf("review/helpfulness:");
                    inde = s.indexOf("review/score:", inds);
                    helpfulness = s.substring(inds + 20, inde);
                } else {
                    helpfulness = "N/A";
                }
            } catch (Exception e) {
                helpfulness = "";
            }

            String score = "";
            try {
                if (s.contains("review/score:")) {
                    inds = s.indexOf("review/score:");
                    inde = s.indexOf("review/time:", inds);
                    score = s.substring(inds + 14, inde);
                } else {
                    score = "N/A";
                }
            } catch (Exception e) {
                score = "";
            }

            String time = "";
            try {
                if (s.contains("review/time:")) {
                    inds = s.indexOf("review/time:");
                    inde = s.indexOf("review/summary:", inds);
                    time = s.substring(inds + 14, inde);
                } else {
                    time = "N/A";
                }
            } catch (Exception e) {
                time = "";
            }

            String summary = "";
            try {
                if (s.contains("review/summary:")) {
                    inds = s.indexOf("review/summary:");
                    inde = s.indexOf("review/text:", inds);
                    summary = s.substring(inds + 16, inde);
                } else {
                    summary = "N/A";
                }
            } catch (Exception e) {
                summary = "";
            }

            String text = "";
            try {
                if (s.contains("review/text:")) {
                    inds = s.indexOf("review/text:");
                    inde = s.indexOf(s.length(), inds);
                    text = s.substring(inds + 14);
                } else {
                    text = "N/A";
                }
            } catch (Exception e) {
                text = "";
            }
            structuredFormat = productId + "\t" + userId + "\t" + profileName + "\t" + helpfulness + "\t" + score
                    + "\t" + time + "\t" + summary + "\t" + text;
            s = "";
            return structuredFormat.trim();
        } else {
            return null;
        }
    }
}

导出ReviewDataUdf.jar,在hive中注册并创建函数。

hive> ADD JAR /home/Kishore/ReviewDataUdf.jar;

hive> create temporary FUNCTION structReview as 'HiveUDF.ReviewDataUdf';

使用structReview函数获取结构化数据。

Create table AmazonReview as
select split(review,"\t")[0] as productId, split(review,"\t")[1] as userId, split(review,"\t")[2] as profileName,split(review,"\t")[3] as helpfulness, split(review,"\t")[4] as score,split(review,"\t")[5] as time,split(review,"\t")[6] as summary,split(review,"\t")[7] as text from(
select structReview(line) As review from temp ) b
where review != "NULL";

数据采用结构化格式表AmazonReview

select productId, userId, profileName from AmazonReview;
OK
B0001G6PZC   A3F3THLLZXURQN      A. Y 
B0001G6PZC   A3J78KAIPW6KAH      Joan Paolo De Bastos "conde_almasy" 
B0001G6PZC   A3F3B6HY9RJI04      James Duckett