Spark Scala异常java.net.MalformedURLException:无协议:

时间:2018-02-05 20:37:51

标签: scala apache-spark

我有一个带有边缘列表的rdd,它以逗号分隔,如(source_URL,destination_URL)。我必须从source_URL中提取源主机。我尝试了以下代码:

val edges = links.flatMap{case (src, dst) =>
if (!src.startsWith("http://") || !src.startsWith("https://"))
  { val src_url = "http://" + src 
    val url = new java.net.URL(src_url)
    val uri = url.getHost
    scala.util.Try {
        Some(uri,dst)}
        .getOrElse(None)}
else 
   { val src_url = src
    val url = new java.net.URL(src_url)
    val uri = url.getHost
    scala.util.Try {
        Some(uri,dst)}
        .getOrElse(None)}

}

输入样本:

http://www.belvini.de/weingut/mID/2530/max-markert.html,http://www.belvini.de/content.php/coID/299/kundenmeinungen.html
http://www.belvini.de/weingut/mID/2530/max-markert.html,http://www.belvini.de/weingueter
http://www.belvini.de/weingut/mID/2530/max-markert.html,http://www.belvini.de/filter/cID/10/country/suedafrika.137.html

必需的输出:

www.belvini.de,http://www.belvini.de/content.php/coID/299/kundenmeinungen.html
www.belvini.de,http://www.belvini.de/weingueter
www.belvini.de,http://www.belvini.de/filter/cID/10/country/suedafrika.137.html

在运行代码时,我遇到了异常:

 Job aborted due to stage failure: Task 935 in stage 3.0 failed 4 times, most recent failure: Lost task 935.3 in stage 3.0 (TID 1883, node27.ib, executor 248): 
java.net.MalformedURLException: For input string: "RC-a-shops.de"
at java.net.URL.<init>(URL.java:627)
at java.net.URL.<init>(URL.java:490)
at java.net.URL.<init>(URL.java:439)

RDD有大约1百万个边缘,我在集群中运行它。 有人可以建议如何摆脱这种异常

2 个答案:

答案 0 :(得分:2)

编辑:编辑问题是为了在MalformedURLException中包含一个格式正确的URL。无论如何,我的答案是。 URL的文档表明,当url无效时,它只会抛出MalformedURLException。更完整的输出将有助于调试此问题。

MalformedURLException - if no protocol is specified, or an unknown protocol is found, or spec is null.

您的src似乎不包含网址协议。你需要像

这样的东西
http://whatever.com/nlp-agm.php

不只是nlp-agm.php

网址格式必须为

<scheme>://<authority><path>?<query>#<fragment>

其中<scheme>是必需的。如果方案无效或未指定,new java.net.URL将抛出MalformedURLException。点击此处:https://docs.oracle.com/javase/7/docs/api/java/net/URL.html#URL(java.lang.String)

答案 1 :(得分:0)

java.net.MalformedURLException:当字符串中带有引号时,也不会引发协议异常:

new Url("\"http:www.example.com\"")