我正在使用来自MovieLens的公开可用的csv数据集 我为ratings.csv创建了一个分区数据集:
kite-dataset create ratings --schema rating.avsc --partition-by year-month.json --format parquet
这是我的year-month.json:
[ {
"name" : "year",
"source" : "timestamp",
"type" : "year"
}, {
"name" : "month",
"source" : "timestamp",
"type" : "month"
} ]
这是我的csv import命令:
mkite-dataset csv-import ratings.csv ratings
导入完成后,我运行此命令以查看实际创建的年份和月份分区:
hadoop fs -ls /user/hive/warehouse/ratings/
我注意到的是,只创建了一年的分区,并且在其中创建了一个月的分区:
[cloudera@quickstart ml-20m]$ hadoop fs -ls /user/hive/warehouse/ratings/
Found 3 items
drwxr-xr-x - cloudera supergroup 0 2016-06-12 18:49 /user/hive/warehouse/ratings/.metadata
drwxr-xr-x - cloudera supergroup 0 2016-06-12 18:59 /user/hive/warehouse/ratings/.signals
drwxrwxrwx - cloudera supergroup 0 2016-06-12 18:59 /user/hive/warehouse/ratings/year=1970
[cloudera@quickstart ml-20m]$ hadoop fs -ls /user/hive/warehouse/ratings/year=1970/
Found 1 items
drwxrwxrwx - cloudera supergroup 0 2016-06-12 18:59 /user/hive/warehouse/ratings/year=1970/month=01
进行此类分区导入的正确方法是什么,这会导致创建所有年份和所有月份分区?
答案 0 :(得分:0)
最后在时间戳中添加三个零。
使用以下shell脚本执行此操作
#!/bin/bash
# add the CSV header to both files
head -n 1 ratings.csv > ratings_1.csv
head -n 1 ratings.csv > ratings_2.csv
# output the first 10,000,000 rows to ratings_1.csv
# this includes the header, and uses tail to remove it
head -n 10000001 ratings.csv | tail -n +2 | awk '{print "000" $1 }' >> ratings_1.csv
enter code here
# output the rest of the file to ratings_2.csv
# this starts at the line after the ratings_1 file stopped
tail -n +10000002 ratings.csv | awk '{print "000" $1 }' >> ratings_2.csv
即使我遇到了这个问题,也在添加3个零后解决了这个问题。