Spark Connector MongoDB - Python API

时间:2016-10-12 14:46:39

标签: python mongodb scala apache-spark pyspark

我喜欢通过Spark从Mongo中提取数据,尤其是PySpark。 我找到了Mongo https://docs.mongodb.com/spark-connector/python-api/

的官方指南

我有所有先决条件:

  • Scala 2.11.8
  • Spark 1.6.2
  • MongoDB 3.0.8(不在Spark的同一设备上)

    $ pyspark --conf " spark.mongodb.input.uri = mongodb的:// mongo1:27019 / xxx.xxx readPreference = primaryPreferred" --packages org.mongodb.spark:mongo-spark-connector_2.11:1.1.0

并且PySpark向我展示了这一点:

df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load()

然后我把这段代码:

16/10/12 16:40:51 INFO HiveContext: Initializing execution hive, version 1.2.1
16/10/12 16:40:51 INFO ClientWrapper: Inspected Hadoop version: 2.6.0
16/10/12 16:40:51 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0
16/10/12 16:40:51 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
16/10/12 16:40:51 INFO ObjectStore: ObjectStore, initialize called
16/10/12 16:40:51 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/10/12 16:40:51 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
16/10/12 16:40:51 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/10/12 16:40:51 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/10/12 16:40:53 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
16/10/12 16:40:53 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/10/12 16:40:53 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/10/12 16:40:54 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/10/12 16:40:54 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/10/12 16:40:54 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
16/10/12 16:40:54 INFO ObjectStore: Initialized ObjectStore
16/10/12 16:40:55 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/10/12 16:40:55 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/10/12 16:40:55 INFO HiveMetaStore: Added admin role in metastore
16/10/12 16:40:55 INFO HiveMetaStore: Added public role in metastore
16/10/12 16:40:55 INFO HiveMetaStore: No user is added in admin role, since config is empty
16/10/12 16:40:55 INFO HiveMetaStore: 0: get_all_databases
16/10/12 16:40:55 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=get_all_databases   
16/10/12 16:40:55 INFO HiveMetaStore: 0: get_functions: db=default pat=*
16/10/12 16:40:55 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=get_functions: db=default pat=* 
16/10/12 16:40:55 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
16/10/12 16:40:55 INFO SessionState: Created local directory: /tmp/8733297b-e0d2-49cf-8557-62c8c4e7cc4a_resources
16/10/12 16:40:55 INFO SessionState: Created HDFS directory: /tmp/hive/root/8733297b-e0d2-49cf-8557-62c8c4e7cc4a
16/10/12 16:40:55 INFO SessionState: Created local directory: /tmp/root/8733297b-e0d2-49cf-8557-62c8c4e7cc4a
16/10/12 16:40:55 INFO SessionState: Created HDFS directory: /tmp/hive/root/8733297b-e0d2-49cf-8557-62c8c4e7cc4a/_tmp_space.db
16/10/12 16:40:55 INFO HiveContext: default warehouse location is /user/hive/warehouse
16/10/12 16:40:55 INFO HiveContext: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
16/10/12 16:40:55 INFO ClientWrapper: Inspected Hadoop version: 2.6.0
16/10/12 16:40:55 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0
16/10/12 16:40:56 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
16/10/12 16:40:56 INFO ObjectStore: ObjectStore, initialize called
16/10/12 16:40:56 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/10/12 16:40:56 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
16/10/12 16:40:56 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/10/12 16:40:56 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/10/12 16:40:57 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
16/10/12 16:40:58 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/10/12 16:40:58 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/10/12 16:40:59 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/10/12 16:40:59 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/10/12 16:40:59 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
16/10/12 16:40:59 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
16/10/12 16:40:59 INFO ObjectStore: Initialized ObjectStore
16/10/12 16:40:59 INFO HiveMetaStore: Added admin role in metastore
16/10/12 16:40:59 INFO HiveMetaStore: Added public role in metastore
16/10/12 16:40:59 INFO HiveMetaStore: No user is added in admin role, since config is empty
16/10/12 16:40:59 INFO HiveMetaStore: 0: get_all_databases
16/10/12 16:40:59 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=get_all_databases   
16/10/12 16:40:59 INFO HiveMetaStore: 0: get_functions: db=default pat=*
16/10/12 16:40:59 INFO audit: ugi=root  ip=unknown-ip-addr  cmd=get_functions: db=default pat=* 
16/10/12 16:40:59 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
16/10/12 16:40:59 INFO SessionState: Created local directory: /tmp/cc4f12a5-e5b2-4a22-a240-04e1ca3727ae_resources
16/10/12 16:40:59 INFO SessionState: Created HDFS directory: /tmp/hive/root/cc4f12a5-e5b2-4a22-a240-04e1ca3727ae
16/10/12 16:40:59 INFO SessionState: Created local directory: /tmp/root/cc4f12a5-e5b2-4a22-a240-04e1ca3727ae
16/10/12 16:40:59 INFO SessionState: Created HDFS directory: /tmp/hive/root/cc4f12a5-e5b2-4a22-a240-04e1ca3727ae/_tmp_space.db
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 139, in load
    return self._df(self._jreader.load())
  File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/usr/local/spark/python/pyspark/sql/utils.py", line 45, in deco
    return f(*a, **kw)
  File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o24.load.
: java.lang.NoSuchMethodError: scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
    at com.mongodb.spark.config.MongoCompanionConfig$class.getOptionsFromConf(MongoCompanionConfig.scala:209)
    at com.mongodb.spark.config.ReadConfig$.getOptionsFromConf(ReadConfig.scala:39)
    at com.mongodb.spark.config.MongoCompanionConfig$class.apply(MongoCompanionConfig.scala:101)
    at com.mongodb.spark.config.ReadConfig$.apply(ReadConfig.scala:39)
    at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:67)
    at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:50)
    at com.mongodb.spark.sql.DefaultSource.createRelation(DefaultSource.scala:36)
    at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)

并且有这个:

{{1}}

我已经尝试过很多可能的选项,可以通过Spark从Mongo中提取数据。有什么提示吗?

1 个答案:

答案 0 :(得分:1)

如果我使用的是在不同版本的Scala中编译的代码,我看起来似乎是一个错误。您是否尝试使用--packages org.mongodb.spark:mongo-spark-connector_2.10:1.1.0运行它?

默认情况下,Spark 1.6.x是针对Scala 2.10编译的,您必须为Scala 2.11手动构建它,如下所示:

./dev/change-scala-version.sh 2.11
mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package