Spark - 使用scala拆分csv文件

时间:2017-05-02 17:32:54

标签: scala apache-spark apache-spark-mllib

我有以下csv文件架构

(Id, OwnerUserId, CreationDate, ClosedDate, Score, Title, Body)

我想用以下方法分割数据:

val splitComma = file.map(x => x.split (",")
val splitComma = file.map(x => x.split (",(?![^<>]*</>)(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

它们都不起作用,下面是我的csv文件样本:

90,58,2008-08-01T14:41:24Z,2012-12-26T03:45:49Z,144,Good branching and merging tutorials for TortoiseSVN?,"<p>Are there any really good tutorials explaining <a href=""http://svnbook.red-bean.com/en/1.8/svn.branchmerge.html"" rel=""nofollow"">branching and merging</a> with Apache Subversion? </p>

<p>All the better if it's specific to TortoiseSVN client.</p>
"
120,83,2008-08-01T15:50:08Z,NA,21,ASP.NET Site Maps,"<p>Has anyone got experience creating <strong>SQL-based ASP.NET</strong> site-map providers?</p>

<p>I've got the default XML file <code>web.sitemap</code> working properly with my Menu and <strong>SiteMapPath</strong> controls, but I'll need a way for the users of my site to create and modify pages dynamically.</p>

<p>I need to tie page viewing permissions into the standard <code>ASP.NET</code> membership system as well.</p>
"
180,2089740,2008-08-01T18:42:19Z,NA,53,Function for creating color wheels,"<p>This is something I've pseudo-solved many times and never quite found a solution. That's stuck with me. The problem is to come up with a way to generate <code>N</code> colors, that are as distinguishable as possible where <code>N</code> is a parameter.</p>
"

使用此功能的最佳方法是什么?

1 个答案:

答案 0 :(得分:3)

无法使用Spark加载具有多行值的CSV(即单元格内的换行符):基础HadoopInputFormat将根据换行符拆分文件,忽略CSV的封装双精度-quotes,所以Spark没有太多可以做的事情(参见讨论here)。

不幸的是,这意味着您必须找到一些“清理”数据的原因(例如,用一些占位符替换换行符),然后再将其写入磁盘或使用Spark加载它。