在Hbase Standalone中批量加载

时间:2016-05-31 17:02:28

标签: python csv hbase

我无法从独立HBase中的csv文件加载数据。我正在使用Python和Happybase API。我打算使用MovieLens数据集。我试图用Python检索记录。它使用Python显示记录。但是没有在HBase表中添加记录。请建议我该怎么办?  请注意4件事:

1)我引用此网站了解代码:https://gist.github.com/jarrettmeyer/26b3e1fcd423071a7a6d

2)我在HBase中创建了一个表,代码为:

代码如下: HBase的(主):003:0> create_namespace“sample_data” 0行,0.1330秒 HBase的(主):009:0>创建“sample_data:user”,“data” 0行(0.3390秒)

=> Hbase :: Table - sample_data:user

3)我创建了一个python程序来读取通过生成100条记录的数据生成的csv。字段是:标题|作者|日期|发布

#!/usr/bin/env python

import csv 
import happybase
import time

batch_size = 100
host = "0.0.0.0"
file_path = "datatry.csv"
namespace = "sample_data"
row_count = 0
start_time = time.time()
table_name = "user"


def connect_to_hbase():
     conn = happybase.Connection(host = host,
        table_prefix = namespace,
        table_prefix_separator = "|")
    conn.open()
    table = conn.table(table_name)
    batch = table.batch(batch_size = batch_size)
    return conn, batch


def insert_row(batch, row):
    batch.put("data:tit":row[0],  "data:auth": row[1], "data:date": row[2], "data:post": row[3] )


def read_csv():
    csvfile = open(file_path, "r")
    csvreader = csv.reader(csvfile)
    return csvreader, csvfile


# After everything has been defined, run the script.
conn, batch = connect_to_hbase()
print "Connect to HBase. table name: %s, batch size: %i" % (table_name, batch_size)
csvreader, csvfile = read_csv()
print "Connected to file. name: %s" % (file_path)

try:
    # Loop through the rows. The first row contains column headers, so skip that
    # row. Insert all remaining rows into the database.
    for row in csvreader:
        row_count += 1
        if row_count == 1:
            pass
        else:
            insert_row(batch, row)

    # If there are any leftover rows in the batch, send them now.
    batch.send()
finally:
    # No matter what happens, close the file handle.
    csvfile.close()
    conn.close()

duration = time.time() - start_time
print "Done. row count: %i, duration: %.3f s" % (row_count, duration)

4)我收到了这个语法错误: [root @ localhost桌面] #python try_data.py   文件“try_data.py”,第60行     batch.put(“data:tit”:row [0],“data:auth”:row [1],“data:date”:row [2],“data:post”:row [3])                         ^ SyntaxError:语法无效

请建议......

0 个答案:

没有答案