Pyspark,错误:输入没有架构所需的预期值和列之后的额外尾随逗号

时间:2018-04-16 04:05:41

标签: python apache-spark indexing pyspark hdfs

首先我制作了两个表(RDD)来使用以下命令

[((u'BibNum', u'ItemCollection', u'CheckoutDateTime'), 1), ((u'1842225', u'namys', u'05/23/2005 03:20:00 PM'), 1)]

第一个RDD中的键是BibNum,ItemCollection和CheckoutDateTime。当我检查第一个RDD的值以使用rdd1.take(2)时,它显示

[((u'BibNum', u'ItemCollection', u'ItemLocation'), 1), ((u'3011076', u'ncrdr', u'qna'), 1)]

类似地,第二个RDD中的键是BibNum,ItemCollection和Itemlocation。值如下:

rdd3=rdd1.join(rdd2)

创建了两个RDD之后,我尝试将这两个RDD加入使用IndexError: list index out of range 之后,当我检查rdd3的值以使用rdd3.take(2)时。发生以下错误。

BibNum,ItemBarcode,ItemType,ItemCollection,CallNumber,CheckoutDateTime,,,,,,,
1842225,10035249209,acbk,namys,MYSTERY ELKINS1999,05/23/2005 03:20:00 PM,,,,,,,
1928264,10037335444,jcbk,ncpic,E TABACK,12/14/2005 05:56:00 PM,,,,,,,
1982511,10039952527,jcvhs,ncvidnf,VHS J796.2 KNOW_YO 2000,08/11/2005 01:52:00 PM,,,,,,,
2026467,10040985615,accd,nacd,CD 782.421642 Y71T,10/19/2005 07:47:00 PM,,,,,,,
2174698,10047696215,jcbk,ncpic,E KROSOCZ,12/29/2005 03:42:00 PM,,,,,,,
1602768,10028318730,jcbk,ncpic,E BLACK,10/08/2005 02:15:00 PM,,,,,,,
2285195,10053424767,accd,cacd,CD 782.42166 F19R,09/30/2005 10:16:00 AM,,,,Input,BinNumber,Date,BinNumber+Month
2245955,10048392665,jcbk,ncnf,J949.73 Or77S 2004,12/05/2005 05:03:00 PM,,,,,,,
770918,10044828100,jcbk,ncpic,E HILL,07/22/2005 03:17:00 PM,,,,,,,

我不知道为什么会这样。如果你知道原因,请告诉我。如果您对我的问题或代码有任何疑问,请告诉我,我会尽力澄清它。感谢

编辑--- 我为每个RDD提供了我的样本输入数据

BibNum,Title,Author,ISBN,PublicationYear,Publisher,Subjects,ItemType,ItemCollection,FloatingItem,ItemLocation,ReportDate,ItemCount,,,,,,,,,,,,,
3011076,A tale of two friends / adapted by Ellie O'Ryan ; illustrated by Tom Caulfield| Frederick Gardner| Megan Petasky| and Allen Tam.,O'Ryan| Ellie,1481425730| 1481425749| 9781481425735| 9781481425742,2014,Simon Spotlight|,Musicians Fiction| Bullfighters Fiction| Best friends Fiction| Friendship Fiction| Adventure and adventurers Fiction,jcbk,ncrdr,Floating,qna,09/01/2017,1,,,,,,,,,,,,,
2248846,Naruto. Vol. 1| Uzumaki Naruto / story and art by Masashi Kishimoto ; [English adaptation by Jo Duffy].,Kishimoto| Masashi| 1974-,1569319006,2003| c1999.,Viz|,Ninja Japan Comic books strips etc| Comic books strips etc Japan Translations into English| Graphic novels,acbk,nycomic,NA,lcy,09/01/2017,1,,,,,,,,,,,,,
3209270,Peace| love & Wi-Fi : a ZITS treasury / by Jerry Scott and Jim Borgman.,Scott| Jerry| 1955-,144945867X| 9781449458676,2014,Andrews McMeel Publishing|,Duncan Jeremy Fictitious character Comic books strips etc| Teenagers United States Comic books strips etc| Parent and teenager Comic books strips etc| Families Comic books strips etc| Comic books strips etc| Comics Graphic works| Humorous comics,acbk,nycomic,NA,bea,09/01/2017,1,,,,,,,,,,,,,
1907265,The Paris pilgrims : a novel / Clancy Carlile.,Carlile| Clancy| 1930-,786706155,c1999.,Carroll & Graf|,Hemingway Ernest 1899 1961 Fiction| Biographical fiction| Historical fiction,acbk,cafic,NA,cen,09/01/2017,1,,,,,,,,,,,,,
1644616,Erotic by nature : a celebration of life| of love| and of our wonderful bodies / edited by David Steinberg.,,094020813X,1991| c1988.,Red Alder Books/Down There Press|,Erotic literature American| American literature 20th century,acbk,canf,NA,cen,09/01/2017,1,,,,,,,,,,,,,

date_count --> DataFrame[BibNum : string, ItemCollection : string, CheckoutDateTime : string, count : BigInt ]

编辑---------------------------------------------- DataFrame[BibNum:string, ItemType:string, ItemCollection:string, ItemBarcode:string, CallNumber:string, CheckoutDateTime:string, Title:string, Author:string, ISBN:string, PublicationYear:string, Publisher:string, Subjects:string, FloatingItem:string, ItemLocation:string, ReportDate:string, ItemLocation:string, : string , : string, : string .... : string , : string ..显示喜欢这个 但是当我使用date_count.take(2)检查它的值时,它会显示如下错误:输入没有架构所需的预期值。需要6个字段,同时提供7个值。

df_final架构如下所示:{{1}}]

1 个答案:

答案 0 :(得分:1)

所以我会尝试回答你的问题。解决方案可能是语法上的混乱,但我会尽力做到最好(我现在没有可以测试的环境)。让我知道这是否是您正在寻找的,否则我可以帮助您微调解决方案。

以下是Join in Pyspark

的文档

所以当你阅读文件时:

rdd1=sc.textFile('checkouts').map(lambda line:line.split(','))
rdd2=sc.textFile('inventory2').map(lambda line:line.split(','))
# Define the headers for both the files
rdd1_header = rdd1.first()
rdd2_header = rdd2.first()

# Define the dataframe
rdd1_df = rdd1.filter(lambda line: line != rdd1_header).toDF(rdd1_header)
rdd2_df = rdd2.filter(lambda line: line != rdd2_header).toDF(rdd2_header)

common_cols = [x for x in rdd1_df.columns if x in rdd2_df.columns]

df_final = rdd1_df.join(rdd2_df, on=common_cols)
date_count = df_final.groupBy(["BibNum", "ItemCollection", "CheckoutDateTime"]).count()

编辑:

1)你的错误:“pyspark.sql.utils.AnalysisException:u”引用'ItemCollection'是不明确的,可能是ItemCollection#3,ItemCollection#21“是由于在连接后生成多个列。你需要什么要做的是包括你的联接中的所有常见列。我将在代码中提及它。

2)另一个问题:一些奇怪的部分被添加到每个RDD的最后部分,例如 - [Row(BibNum = u'1842225',ItemBarcode = u'10035249209',ItemType = u'acbk', ItemCollection = u'namys',CallNumber = u'MYSTERY ELKINS1999',CheckoutDateTime = u'05 / 23/2005 03:20:00 PM',= u'',= u'',= u'',= u' ',= u'',= u'',= u'')

为此,您提到了CSV文件,如下所示:

BibNum,ItemBarcode,ItemType,ItemCollection,CallNumber,CheckoutDateTime,,,,,,,
1842225,10035249209,acbk,namys,MYSTERY ELKINS1999,05/23/2005 03:20:00 PM,,,,,,,
1928264,10037335444,jcbk,ncpic,E TABACK,12/14/2005 05:56:00 PM,,,,,,,

现在,如果您可以看到日期列之后有很多尾随逗号。即',,,,,,,',它可以提供那些额外的空列(在逗号分割后),你可以删除。