唯一触及主题的帖子是here,但它无法解决我的问题。
以下是我们通过以下方式收集镶木地板到本地备份的问题:
$ hadoop fs -getmerge /dir/on/hdfs /local/dir
错误是我们认为镶木地板多文件组织是由于HDFS写作,但我们不明白它实际上是镶木地板文件"正常"组织。所以(不是很聪明)我们使用HDFS的getmerge做备份。问题是我们的数据已被删除,我们现在正在努力恢复它。
在分析(并阅读doc)镶木地板时,我们发现所有文件最初都是由块组成的,其中包含魔术数字' PAR1'之间的数据+元数据。并添加到2 - _metadata和_common_metadata - 元数据文件。
通过按顺序观察getmerge进程文件(hdfs上的原始镶木地板目录),我想出了一个脚本,该数据采用2' PAR1'之间的数据。并使它成为一个块文件。 构建的前2个文件是(_common_metadata,_metadata)。
filePrefix='part-'
finalFilePrefix='part-r-'
awk 'NR%2==0{ print $0 > "part-"i++ }' RS='PAR1' $1
nbFiles=$(ls -lah | grep 'part-' | wc -l)
for num in $(seq 0 $nbFiles)
do
fileName="$filePrefix$num"
lastName=""
if [ "$num" -eq "0" ]; then
lastName="_common_metadata"
awk '{print "PAR1" $0 "PAR1"}' $fileName > $lastName
else
if [ "$num" -eq "1" ]; then
lastName="_metadata"
awk '{print "PAR1" $0 "PAR1"}' $fileName > $lastName
else
if [ -e $fileName ]; then
count=$( printf "%05d" $(($num-2)) )
lastName="$finalFilePrefix$count.gz.parquet"
awk '{print "PAR1" $0 "PAR1"}' $fileName > $lastName
fi
fi
fi
echo $lastName
truncate --size=-1 $lastName
rm -f "$fileName"
done
mv $1 $1.backup
mkdir $1
mv _* $1
mv part* $1
关于剧本的一些观察:
代码:
val newDataDF = sqlContext.read.parquet("/tmp/userActionLog2-leclerc-culturel-2016.09.04.parquet")
newDataDF.take(1)
错误:
newDataDF: org.apache.spark.sql.DataFrame = [bson: binary]
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 5, hdp-node4.affinytix.com): java.io.IOException: can not read class org.apache.parquet.format.PageHeader: don't know what type: 13
at org.apache.parquet.format.Util.read(Util.java:216)
at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
at org.apache.parquet.hadoop.ParquetFileReader$WorkaroundChunk.readPageHeader(ParquetFileReader.java:668)
at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546)
at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496)
at org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader.checkEndOfRowGroup(UnsafeRowParquetRecordReader.java:604)
at org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader.loadBatch(UnsafeRowParquetRecordReader.java:218)
at org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader.nextKeyValue(UnsafeRowParquetRecordReader.java:196)
at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:194)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1881)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1881)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: parquet.org.apache.thrift.protocol.TProtocolException: don't know what type: 13
at parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:806)
at parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:500)
at org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:158)
at org.apache.parquet.format.PageHeader.read(PageHeader.java:828)
at org.apache.parquet.format.Util.read(Util.java:213)
... 32 more
鉴于我们的数据在这里受到威胁,如果有人有任何想法可以提供帮助,我会提前热烈地感谢他(呃)。
再见
答案 0 :(得分:0)
我已回答了这个问题。
我在开始时的基本想法是可以的。问题只是awk(在解决方案脚本中)添加了许多字符。 因此,镶木地板块之后是不可读的。
解决方案是通过编程(python,perl ...)来操作合并文件。 这是我提出的python解决方案。它等同于前一个,但它不会添加无用的字符。
代码:
print "create parquet script."
import sys
filename = sys.argv[1]
import locale
currencode=locale.getpreferredencoding()
import io
print "====================================================================="
print "Create parquet from: ", filename
print "defautl buffer size: ", io.DEFAULT_BUFFER_SIZE
print "default encoding of the system: ", currencode
print "====================================================================="
import re
magicnum = "PAR1"
with io.open(filename, mode='rb') as f:
content = f.read()
res = [ magicnum + chunk + magicnum for chunk in filter(lambda s: s!="", re.split(magicnum, content)) ]
szcontent = len(res[2:])
for i in range(0,szcontent) :
si = str(i)
write_to_binfile("part-r-{}.gz.parquet".format(si.zfill(5)), res[i+2])
write_to_binfile("_common_metadata", res[0])
write_to_binfile("_metadata", res[1])
import os
os.system("mv {} {}.backup".format(filename, filename))
os.system("mkdir {}".format(filename))
os.system("mv _* {}".format(filename))
os.system("mv part* {}".format(filename))
观察: 镶木地板文件不能太大,因为python函数将整个内存作为字符串加载(几十兆字节是可以的)! 必须在linux / unix上执行,因为最后的系统调用是基于unix的。
再见