尝试使用pyspark运行一个简单的GraphFrame示例。
spark版本:2.0
graphframe版本:0.2.0
我可以在Jupyter中导入图形框架:
from graphframes import GraphFrame
GraphFrame
graphframes.graphframe.GraphFrame
当我尝试创建GraphFrame对象时出现此错误:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-23-2bf19c66804d> in <module>()
----> 1 gr_links = GraphFrame(df_web_page, df_parent_child_link)
/Users/roopal/software/graphframes-release-0.2.0/python/graphframes/graphframe.pyc in __init__(self, v, e)
60 self._sc = self._sqlContext._sc
61 self._sc._jvm.org.apache.spark.ml.feature.Tokenizer()
---> 62 self._jvm_gf_api = _java_api(self._sc)
63 self._jvm_graph = self._jvm_gf_api.createGraph(v._jdf, e._jdf)
64
/Users/roopal/software/graphframes-release-0.2.0/python/graphframes/graphframe.pyc in _java_api(jsc)
32 def _java_api(jsc):
33 javaClassName = "org.graphframes.GraphFramePythonAPI"
---> 34 return jsc._jvm.Thread.currentThread().getContextClassLoader().loadClass(javaClassName) \
35 .newInstance()
36
/Users/roopal/software/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
931 answer = self.gateway_client.send_command(command)
932 return_value = get_return_value(
--> 933 answer, self.gateway_client, self.target_id, self.name)
934
935 for temp_arg in temp_args:
/Users/roopal/software/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.pyc in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
/Users/roopal/software/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
310 raise Py4JJavaError(
311 "An error occurred while calling {0}{1}{2}.\n".
--> 312 format(target_id, ".", name), value)
313 else:
314 raise Py4JError(
Py4JJavaError: An error occurred while calling o138.loadClass.
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
python代码试图读取java类(在jar中)我猜,但似乎无法找到它。 有任何建议如何解决这个问题?
答案 0 :(得分:10)
根据你的火花版本,你所要做的就是在https://spark-packages.org/package/graphframes/graphframes下载与你的火花版本对应的图形框架。
然后你必须将下载的jar复制到你的spark jar目录
root@93d8398b53f2:/usr/local/spark/jars# wget http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.3.0-spark2.0-s_2.11/graphframes-0.3.0-spark2.0-s_2.11.jar
这里有小小的三重奏,第一次使用参数启动pyspark,以便下载所有graphframe的jars依赖项:
root@93d8398b53f2:/usr/local/spark/bin# pyspark --packages graphframes:graphframes:0.3.0-spark2.0-s_2.11 --jars graphframes-0.3.0-spark2.0-s_2.11.jar
这应该出现:
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/spark-2.0.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found graphframes#graphframes;0.3.0-spark2.0-s_2.11 in spark-packages
found com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 in central
found com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 in central
found org.scala-lang#scala-reflect;2.11.0 in central
found org.slf4j#slf4j-api;1.7.7 in central
downloading http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.3.0-spark2.0-s_2.11/graphframes-0.3.0-spark2.0-s_2.11.jar ...
[SUCCESSFUL ] graphframes#graphframes;0.3.0-spark2.0-s_2.11!graphframes.jar (269ms)
downloading https://repo1.maven.org/maven2/com/typesafe/scala-logging/scala-logging-api_2.11/2.1.2/scala-logging-api_2.11-2.1.2.jar ...
[SUCCESSFUL ] com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2!scala-logging-api_2.11.jar (53ms)
downloading https://repo1.maven.org/maven2/com/typesafe/scala-logging/scala-logging-slf4j_2.11/2.1.2/scala-logging-slf4j_2.11-2.1.2.jar ...
[SUCCESSFUL ] com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2!scala-logging-slf4j_2.11.jar (66ms)
downloading https://repo1.maven.org/maven2/org/scala-lang/scala-reflect/2.11.0/scala-reflect-2.11.0.jar ...
[SUCCESSFUL ] org.scala-lang#scala-reflect;2.11.0!scala-reflect.jar (1409ms)
downloading https://repo1.maven.org/maven2/org/slf4j/slf4j-api/1.7.7/slf4j-api-1.7.7.jar ...
[SUCCESSFUL ] org.slf4j#slf4j-api;1.7.7!slf4j-api.jar (53ms)
:: resolution report :: resolve 6161ms :: artifacts dl 1877ms
:: modules in use:
com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 from central in [default]
com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 from central in [default]
graphframes#graphframes;0.3.0-spark2.0-s_2.11 from spark-packages in [default]
org.scala-lang#scala-reflect;2.11.0 from central in [default]
org.slf4j#slf4j-api;1.7.7 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 5 | 5 | 5 | 0 || 5 | 5 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
5 artifacts copied, 0 already retrieved (4713kB/39ms)
Warning: Local jar /usr/local/spark-2.0.0-bin-hadoop2.7/bin/graphframes-0.3.0-spark2.0-s_2.11.jar does not exist, skipping.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/11/17 15:43:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/11/17 15:43:54 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.0.0
/_/
Using Python version 2.7.12 (default, Jul 2 2016 17:42:40)
SparkSession available as 'spark'.
>>>
意味着它已下载了所需的所有依赖项。 重要的是这里的常春藤默认缓存设置为:/root/.ivy2/cache,正好是存储在/root/.ivy2/jars
中的jar你可以在之后退出,如果你坚持继续调用GraphFrame的python代码,它会调用错误:
Py4JJavaError: An error occurred while calling o561.newInstance.
: java.lang.NoClassDefFoundError: Could not initialize class org.graphframes.GraphFrame.
让我们看看目录/root/.ivy2/jars中的内容:
root@93d8398b53f2:/usr/local/spark/bin# ls /root/.ivy2/jars/
com.typesafe.scala-logging_scala-logging-api_2.11-2.1.2.jar com.typesafe.scala-logging_scala-logging-slf4j_2.11-2.1.2.jar graphframes_graphframes-0.3.0-spark2.0-s_2.11.jar org.scala-lang_scala-reflect-2.11.0.jar org.slf4j_slf4j-api-1.7.7.jar
现在,您要将出现在/root/.ivy2/jars中的所有广告文件复制到您的spark的jars目录中:
root@93d8398b53f2:/usr/local/spark/jars# cp /root/.ivy2/jars/* .
第二次启动pyspark:
root@93d8398b53f2:/usr/local/spark/jars# pyspark --packages graphframes:graphframes:0.3.0-spark2.0-s_2.11 --jars graphframes-0.3.0-spark2.0-s_2.11.jar
这应该出现:
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/spark-2.0.0-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found graphframes#graphframes;0.3.0-spark2.0-s_2.11 in spark-packages
found com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 in central
found com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 in central
found org.scala-lang#scala-reflect;2.11.0 in central
found org.slf4j#slf4j-api;1.7.7 in central
:: resolution report :: resolve 748ms :: artifacts dl 27ms
:: modules in use:
com.typesafe.scala-logging#scala-logging-api_2.11;2.1.2 from central in [default]
com.typesafe.scala-logging#scala-logging-slf4j_2.11;2.1.2 from central in [default]
graphframes#graphframes;0.3.0-spark2.0-s_2.11 from spark-packages in [default]
org.scala-lang#scala-reflect;2.11.0 from central in [default]
org.slf4j#slf4j-api;1.7.7 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 5 | 0 | 0 | 0 || 5 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 5 already retrieved (0kB/24ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/11/17 15:53:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/11/17 15:53:03 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.0.0
/_/
Using Python version 2.7.12 (default, Jul 2 2016 17:42:40)
SparkSession available as 'spark'.
>>>
您现在可以享受GraphFrame:
>>> # Create an Edge DataFrame with "src" and "dst" columns
... e = sqlContext.createDataFrame([
... ("a", "b", "friend"),
... ("b", "c", "follow"),
... ("c", "b", "follow"),
... ], ["src", "dst", "relationship"])
>>> # Create a GraphFrame
... from graphframes import *
>>> g = GraphFrame(v, e)
>>>
>>> # Query: Get in-degree of each vertex.
... g.inDegrees.show()
+---+--------+
| id|inDegree|
+---+--------+
| c| 1|
| b| 2|
+---+--------+
>>>
>>> # Query: Count the number of "follow" connections in the graph.
... g.edges.filter("relationship = 'follow'").count()
2
>>> results.vertices.select("id", "pagerank").show()
16/11/17 16:03:45 WARN Executor: 1 block locks were not released by TID = 9059:
[rdd_337_0]
16/11/17 16:03:45 WARN Executor: 1 block locks were not released by TID = 9060:
[rdd_337_1]
+---+-------------------+
| id| pagerank|
+---+-------------------+
| a| 0.01|
| b| 0.2808611427228327|
| c|0.27995525261339177|
+---+-------------------+
答案 1 :(得分:1)
最简单的方法是使用pyspark启动jupyter,graphframes是使用相应的包从pyspark启动jupyter
只需打开终端并设置两个环境变量,然后使用graphframes软件包开始pyspark
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
pyspark --packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
这样做的好处还在于,如果您以后想通过spark-submit
运行代码,则可以使用相同的启动命令
答案 2 :(得分:0)
确保您的PYSPARK_SUBMIT_ARGS已更新为kernel.json~ / .ipython / kernels // kernel.json中的“--packages graphframes:graphframes:0.2.0-spark2.0”。
您可能已经查看了以下link。它有关于Jupiter设置的更多细节。基本上,pyspark必须提供graphframes.jar。
答案 3 :(得分:0)
@Gilles Essoki解决方案的后续行动。确保您的环境具有正确的Spark版本和Scala版本。
graphframes :(最新版本)-spark(你的火花版本)-s_(你的scala版本)
我做不必须指定jar文件或在我拥有正确版本时将其复制到spark默认jar目录。注意:您需要运行&#39; spark-shell&#39; CMD。
%spark-shell ... ... ... Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.0 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)获取此设置的正确版本
对于我的环境,我必须使用以下命令:
%pyspark --packages graphframes:graphframes:0.3.0-spark1.6-s_2.10
答案 4 :(得分:0)
对于PyCharm,请转到配置并添加环境变量:
名称: PYSPARK_SUBMIT_ARGS
值: --packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 pyspark-shell
我发现如果没有pyspark-shell的话,它对我就不起作用