我需要将文本文件读取到Spark中的数据集[T]中。该文件格式不正确,因为它具有一些空白字段,并且很难定义参数来分割字符串。我一直试图将数据读入RDD,然后将其转换为case类类型,但是,并非所有字段都被正确解析,并且出现错误:
<div class="col-md-3 col-sm-3 col-cs-12 left-sidebar">
<div class="input-group searchbox">
<div class="input-group-btn">
<center><a href="find_friends.php"><button id="" class="btn btn-default search-icon" name="search_user" type="submit">Add new user</button></a></center>
</div>
</div>
<div class="left-chat">
<ul>
<li>
<div class='chat-left-img'> <img src='$user_profilepic'>
</div>
<div class='chat-left-details'>
<a href='home.php?user_name=$user_name'>$user_name</a>
<span style='font-size: 12px; color: #5D5C5C;'>(You)
</span><br>
</div>
</li>
</ul>
</div>
我该怎么做才能正确处理此文件? 我的.txt文件如下所示(经过匿名处理的随机数据,但格式相同):
java.lang.NumberFormatException: empty String
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1842)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at scala.collection.immutable.StringLike.toDouble(StringLike.scala:321)
at scala.collection.immutable.StringLike.toDouble$(StringLike.scala:321)
at scala.collection.immutable.StringOps.toDouble(StringOps.scala:33)
at captify.test.spark.Stats$$anonfun$2.apply(Stats.scala:53)
at captify.test.spark.Stats$$anonfun$2.apply(Stats.scala:53)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:255)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:836)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:836)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
我试图以这种方式处理它:
NEW50752085 84.0485 -76.3851 85.1 THE NAME OF AN OBJECT
DEM00752631 51.9581 -85.3315 98.5 THE NAME OF AN OBJECT
KI004867205 40.8518 15.9351 276.5 THE NAME OF AN OBJECT FHG 41196
答案 0 :(得分:1)
我将在此答案中做出一些可能不正确的假设,但根据您提供的数据和提供的错误,我认为这些假设是正确的。
您的NumberFormatException是由您拆分一个空格引起的。假设以下行由空格分隔:
NEW50752085 84.0485 -76.3851 85.1 THE NAME OF AN OBJECT
在一个空格上分割时,此行将转换为以下数组:
Array(NEW50752085, "", "", "", 84.0485, "", "", "", -76.3851, "", "", "", 85.1, "", "", "", THE, NAME, OF, AN, OBJECT)
此数组的第二个元素是一个空字符串,是您要转换为Double的元素。那就是给你空字符串NumberFormatException的原因。
.map(_.split(" "))
当您将其更改为4个空格时(根据我的假设,这可能是正确的也可能不是正确的),您将获得以下信息:
Array(NEW50752085, 84.0485, -76.3851, 85.1, THE NAME OF AN OBJECT)
但是现在我们遇到另一个问题-这只有五个要素!我们要七个。
我们可以通过修改您的后续代码来更改此设置:
val df = dataArray.map(record => {
(record(0), record(1).toDouble, record(2).toDouble, record(3).toDouble, record(4),
if(record.size > 5) record(5) else "",
if(record.size > 6) record(6) else "")
}).map{case (c1, c2, c3, c4, c5, c6, c7) => caseClass(c1, c2, c3, c4, c5, c6, c7)}.toDF
df.show
+-----------+-------+--------+----+--------------------+---+-----+
| c1| c2| c3| c4| c5| c6| c7|
+-----------+-------+--------+----+--------------------+---+-----+
|NEW50752085|84.0485|-76.3851|85.1|THE NAME OF AN OB...| | |
|DEM00752631|51.9581|-85.3315|98.5|THE NAME OF AN OB...| | |
|KI004867205|40.8518| 15.9351|76.5|THE NAME OF AN OB...|FHG|41196|
+-----------+-------+--------+----+--------------------+---+-----+
同样,仅当所有元素都由相同数量的空格分隔时,这种方法才有效。
答案 1 :(得分:-1)
如果您的数据没有可通过spark读取的封闭格式,那么您唯一的选择是使用FileInputFormat
这样,您将能够定义数据每一行的分辨率流,从而确定分割和处理边缘情况的方式。
深入研究它的最佳方法是作为示例。这是非常可靠的一个: https://www.ae.be/blog-en/ingesting-data-spark-using-custom-hadoop-fileinputformat/