在Hive中,当我从csv文件加载数据时,我只获得了一部分列,而不是整个事物

时间:2018-04-06 23:00:03

标签: hive loading import-from-csv

以下是我的数据源中的列

BibNum  
Title   
Author  
ISBN    
PublicationYear 
Publisher   
Subjects    
ItemType    
ItemCollection  
FloatingItem    
ItemLocation    
ReportDate  
ItemCount

我只有publisher列的值。 我上传了一个截图,如果你知道原因和方法可以修复,请告诉我真的很感激:

Screenshot of messy columns

下面是第一行的真实值(我用//标记分隔表示每一列)

3011076// 
A tale of two friends / adapted by Ellie O'Ryan ; illustrated by Tom Caulfield, Frederick Gardner, Megan Petasky, and Allen Tam.     // 
O'Ryan, Ellie   // 
1481425730, 1481425749, 9781481425735, 9781481425742    // 
2014    // 
Simon Spotlight,    Musicians Fiction, Bullfighters Fiction, Best friends Fiction, Friendship Fiction, Adventure and adventurers Fiction    // 
jcbk    // 
ncrdr   // 
Floating // 
qna  // 
09/01/2017 //   
1

这是第二行的真正价值

2248846 //  
Naruto. Vol. 1, Uzumaki Naruto / story and art by Masashi Kishimoto ; [English adaptation by Jo Duffy]. // 
Kishimoto, Masashi, 1974- //    
1569319006  // 
2003, c1999.    // 
Viz,    Ninja Japan Comic books strips etc, Comic books strips etc Japan Translations into English, Graphic novels //   
acbk//  
nycomic//   
NA//    
lcy//   
09/01/2017//    
1



hive> select * from timesheet limit 3;
OK
NULL    Title   Author  ISBN    PublicationYear Publisher   Subjects    ItemType    ItemCollection  FloatingItem    ItemLocation    ReportDate  ItemCount
3011076 "A tale of two friends / adapted by Ellie O'Ryan ; illustrated by Tom Caulfield Frederick Gardner    Megan Petasky   and Allen Tam."    "O'Ryan  Ellie" "1481425730  1481425749  9781481425735   9781481425742" 2014.   "Simon Spotlight
2248846 "Naruto. Vol. 1  Uzumaki Naruto / story and art by Masashi Kishimoto ; [English adaptation by Jo Duffy]."   "Kishimoto   Masashi     1974-" 1569319006  "2003    c1999."    "Viz    "   "Ninja Japan Comic books strips etc  Comic books strips etc Japan Translations into English
Time taken: 0.149 seconds
hive> desc timesheet
    > ;
OK
bibnum  bigint  
title   string  
author  string  
isbn    string  
publication string  
publisher   string  
subjects    string  
itemtype    string  
itemcollection  string  
floatingitem    string  
itemlocation    string  
reportdate  string  
itemcount   string  
Time taken: 0.21 seconds

BibNum,Title,Author,ISBN,PublicationYear,Publisher,Subjects,ItemType,ItemCollection,FloatingItem,ItemLocation,ReportDate,ItemCount | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | |

3011076,"两个朋友的故事/由Ellie O&#Ry;改编;由Tom Caulfield,Frederick Gardner,Megan Petasky和Allen Tam撰写。" O' Ryan,Ellie"," 1481425730,1481425749,9781481425735,9781481425742", 2014年," Simon Spotlight,","音乐家小说,斗牛小说,最佳朋友小说,友谊小说,冒险和冒险小说",jcbk,ncrdr,Floating,qna,09 / 01 / 2017,1 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |空|

2 个答案:

答案 0 :(得分:0)

所以Apache Hive本身无法像这个CSV那样处理数据,但是使用SerDe(Serializer / Deserializer)它可以帮助解决这个问题

使用hive v0.14 +内置serde,默认分隔符为WARN org.hibernate.engine.jdbc.spi.SqlExceptionHelper - SQL Error: -5501, SQLState: 42501 ERROR org.hibernate.engine.jdbc.spi.SqlExceptionHelper - user lacks privilege or object not found: <Database table object name> ,因此对于您的CSV,这应该可以使用

,

如果任何列中都有未转义的引号,您必须手动进入并确定哪些列是哪个列...

答案 1 :(得分:0)

由于csv文件用逗号分隔,因此如果您将列指定为字符串,则整行将被加载到该列中。因此,在创建表时,您可以指定行值由逗号分隔。

create table table_name (
....
) row format delimited fields terminated by ',' lines terminated by '\n';

然后使用加载csv文件

load data local inpath path_to_file to table table_name;

希望这会有所帮助:)