我正在使用Pig加载一个大型CSV文件(50,000行)。但是,当我尝试一次加载所有50,000行时,输出仅包含32,866行而不是文件中的50,000行。
如果我将CSV文件分成5个文件,每个文件有10,000行,我会得到正确数量的50,000行。
你知道为什么会这样吗?
以下是我用来加载数据的逻辑:
A1= LOAD 'A_1.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'YES_MULTILINE','UNIX', 'SKIP_INPUT_HEADER') AS (Id: long,PostTypeId: long,AcceptedAnswerId: long,ParentId: long, CreationDate:chararray,DeletionDate:chararray,Score: long,ViewCount:long, Body: chararray,OwnerUserId:long, OwnerDisplayName: chararray, LastEditorUserId: long,LastEditorDisplayName: chararray,LastEditDate: chararray, LastActivityDate: chararray, Title: chararray, Tags: chararray, AnswerCount: int, CommentCount: int, FavoriteCount: int, ClosedDate: chararray, CommunityOwnedDate: chararray );
以下是数据样本:
Id,PostTypeId,AcceptedAnswerId,ParentId,CreationDate,DeletionDate,Score,ViewCount,Body,OwnerUserId,OwnerDisplayName,LastEditorUserId,LastEditorDisplayName,LastEditDate,LastActivityDate,Title,Tags,AnswerCount,CommentCount,FavoriteCount,ClosedDate,CommunityOwnedDate
"927358","1","927386","","2009-05-29 18:09:14","","16763","6080937","<p>I accidentally committed wrong files to <a href=""http://en.wikipedia.org/wiki/Git_%28software%29"" rel=""noreferrer"">Git</a>, but I haven't pushed the commit to the server yet.</p>
<p>How can I undo those commits from the local repository? </p>
","89904","","2533071","","2018-02-16 21:58:57","2018-03-09 17:15:28","How to undo the most recent commits in Git","<git><git-commit><git-reset><git-revert>","68","6","5598","","2013-03-16 10:08:31"
"2003505","1","2003515","","2010-01-05 01:12:15","","12774","5330484","<p>I want to delete a branch both locally and on my remote project fork on <a href=""http://en.wikipedia.org/wiki/GitHub"" rel=""noreferrer"">GitHub</a>.</p>
<h3>Failed Attempts to Delete Remote Branch</h3>
<pre><code>$ git branch -d remotes/origin/bugfix
error: branch 'remotes/origin/bugfix' not found.
$ git branch -d origin/bugfix
error: branch 'origin/bugfix' not found.
$ git branch -rd origin/bugfix
Deleted remote branch origin/bugfix (was 2a14ef7).
$ git push
Everything up-to-date
$ git pull
From github.com:gituser/gitproject
* [new branch] bugfix -> origin/bugfix
Already up-to-date.
</code></pre>
<p>What do I need to do differently to successfully delete the
<code>remotes/origin/bugfix</code> branch both locally and on GitHub?</p>
","95592","","4694621","","2017-12-24 14:11:39","2018-03-05 06:47:25","How do I delete a Git branch both locally and remotely?","<git><git-branch><git-remote>","40","2","4172","",""
"5585779","1","5585800","","2011-04-07 18:27:54","","2345","4927450","<p>How can I convert a <code>String</code> to an <code>int</code> in Java?</p>
<p>My String contains only numbers, and I want to return the number it represents.</p>
<p>For example, given the string <code>""1234""</code> the result should be the number <code>1234</code>.</p>
","537967","","63550","user166390","2018-01-17 23:49:28","2018-02-21 22:33:28","How do I convert a String to an int in Java?","<java><string><int><type-conversion>","31","0","409","",""
我非常感谢您提供的任何帮助!