标记SparkRDD

时间:2016-10-18 06:56:08

标签: scala apache-spark hbase rdd tagging

我们有出版物的数据,我们希望将它们标记为各种类别。我们已将它们存储到HBase中,然后我们将它们保存到sparkrdd中,然后使用scala代码对其进行标记。 HBase数据的样本如下所示:

PubEntity:Abstract                  timestamp=1476537886382, value=not found                                                                  
 PubEntity:Affiliations              timestamp=1476537886382, value=[]                                                                         
 PubEntity:Article_Title             timestamp=1476537886382, value=Formate assay in body fluids: application in methanol poisoning.           
 PubEntity:Author                    timestamp=1476537886382, value=[{'LastName': 'Makar', 'ForeName': 'A B', 'author_name': 'A B Makar', 'Init
                                     ials': 'AB', 'author_affiliation': 'not found'}, {'LastName': 'McMartin', 'ForeName': 'K E', 'author_name'
                                     : 'K E McMartin', 'Initials': 'KE', 'author_affiliation': 'not found'}, {'LastName': 'Palese', 'ForeName':
                                      'M', 'author_name': 'M Palese', 'Initials': 'M', 'author_affiliation': 'not found'}, {'LastName': 'Tephly
                                     ', 'ForeName': 'T R', 'author_name': 'T R Tephly', 'Initials': 'TR', 'author_affiliation': 'not found'}]  
 PubEntity:Journal_Title             timestamp=1476537886382, value=Biochemical medicine                                                       
 PubEntity:PMID                      timestamp=1476537886382, value=1                                                                          
 PubRemaining:Countries              timestamp=1476537886382, value=[]                                                                         
 PubRemaining:Created_At             timestamp=1476537886382, value=170812800.0                                                                
 PubRemaining:DOI                    timestamp=1476537886382, value=not found                                                                  
 PubRemaining:Date_Created           timestamp=1476537886382, value=19760116                                                                   
 PubRemaining:ISO_Abbreviation       timestamp=1476537886382, value=Biochem Med                                                                
 PubRemaining:ISSN                   timestamp=1476537886382, value=0006-2944                                                                  
 PubRemaining:Pub_Date               timestamp=1476537886382, value=01 Jun, 1975                                                               
 PubRemaining:Year                   timestamp=1476537886382, value=1975 

我们已经能够在

的第一个答案中解释火花中RDD

How to read from hbase using spark 这是代码:

import org.apache.hadoop.hbase.client.{HBaseAdmin, Result}
import org.apache.hadoop.hbase.{ HBaseConfiguration, HTableDescriptor }
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable

import org.apache.spark._

object HBaseRead {
  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("HBaseRead").setMaster("local[2]").set("spark.driver.allowMultipleContexts", "true").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    val sc = new SparkContext(sparkConf)
    val conf = HBaseConfiguration.create()
    val tableName = "name_of_the_database"

    System.setProperty("user.name", "hdfs")
    System.setProperty("HADOOP_USER_NAME", "hdfs")
    conf.set("hbase.master", "localhost:60000")
    conf.setInt("timeout", 120000)
    conf.set("hbase.zookeeper.quorum", "localhost")
    conf.set("zookeeper.znode.parent", "/hbase-unsecure")
    conf.set(TableInputFormat.INPUT_TABLE, tableName) 

    val admin = new HBaseAdmin(conf)
    if (!admin.isTableAvailable(tableName)) {
      val tableDesc = new HTableDescriptor(tableName)
      admin.createTable(tableDesc)
    }

    val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
    println(" Number of Records found : " + hBaseRDD.count())
    println(hBaseRDD.first())
    sc.stop()
  }
}

在Scala中运行此代码块之后,我得到了:defined module HBaseRead 然后我做了HBaseRead.main(Array()),它输出找到的记录数并读取第一条记录:

(31,keyvalues = {1 / PubEntity:Abstract / 1476537886382 / Put / vlen = 9 / mvcc = 0,1 / Entity:Affiliations / 1476537886382 / Put / vlen = 2 / mvcc = 0,1 / Entity:Article_Title / 1476537886382 / Put / vlen = 64 / mvcc = 0,1 / Entity:Author / 1476537886382 / Put / vlen = 497 / mvcc = 0,1 / Entity:Journal_Title / 1476537886382 / Put / vlen = 20 / mvcc = 0,1 /实体:PMID / 1476537886382 / Put / vlen = 1 / mvcc = 0,1 / Remaining:Countries / 1476537886382 / Put / vlen = 2 / mvcc = 0,1 / Remaining:Created_At / 1476537886382 / Put / vlen = 11 / mvcc = 0,1 /剩余:DOI / 1476537886382 / Put / vlen = 9 / mvcc = 0,1 /剩余:Date_Created / 1476537886382 / Put / vlen = 8 / mvcc = 0,1 / Remaining:ISO_Abbreviation / 1476537886382 / Put / vlen = 11 / mvcc = 0,1 /剩余:ISSN / 1476537886382 / Put / vlen = 9 / mvcc = 0,1 /剩余:Pub_Date / 1476537886382 / Put / vlen = 12 / mvcc = 0,1 /剩余:年/ 1476537886382 /放/ VLEN = 4 / MVCC = 0})

现在在这个输出中,你会看到vlen = 12 / mvcc = 0。在检查数据时,我发现vlen是每个单词/数字的长度。我无法弄清楚mvcc的用途。 我们希望输出显示单词/数字而不是vlen = 4。此外,我们希望阅读这些条目并在其中查找某些单词和短语并相应地标记它们。全部在斯卡拉。

任何有用的在线资源的链接都将受到高度赞赏。

0 个答案:

没有答案