I am using the IMDB data set for one of my POC.
Data is available here
One of the sample data is like
nm0000006 Ingrid Bergman 1915 1982 actress,soundtrack,producer tt0038109,tt0071877,tt0034583,tt0038787
nm0000007 Humphrey Bogart 1899 1957 actor,soundtrack,producer tt0033870,tt0038355,tt0034583,tt0040897
nm0000008 Marlon Brando 1924 2004 actor,soundtrack,director tt0068646,tt0047296,tt0078346,tt0078788
nm0000009 Richard Burton 1925 1984 actor,producer,soundtrack tt0057877,tt0061184,tt0065207,tt0087803
nm0000010 James Cagney 1899 1986 actor,soundtrack,director tt0042041,tt0029870,tt0055256,tt0035575
nm0000011 Gary Cooper 1901 1961 actor,soundtrack,producer tt0044706,tt0049233,tt0033891,tt0027996
The table I have created is as
Create external table casts( id STRING, name STRING, birthYear INT,deathYear INT, profession ARRAY<STRING>,titles ARRAY<STRING>) row format delimited fields terminated by '\t' lines terminated by '\n' tblproperties ("skip.header.line.count"="1");
I want to run a query like who were the actors for a particular movie title(say tt0057877).
I also have another sample data like
tconst averageRating numVotes
tt0000001 5.8 1347
tt0000002 6.5 156
tt0000003 6.6 929
tt0000004 6.4 93
tt0000005 6.2 1613
I also want to run query like , show top 10 actors , who took part as an actor in the top rated movies.
Is there a way to do the above in hive( preferably without UDF)..
Thanks !