如何使用其更改源从RethinkDB数据库中删除数据

时间:2016-10-12 09:36:54

标签: python rethinkdb

我正在使用控制器'对于连续累积数据但仅使用定义为少于3天的最新数据的数据库。一旦数据超过3天,我就想将其转储到JSON文件并将其从数据库中删除。

为了模拟这一点,我做了以下事情。 '控制器'程序rethinkdb_monitor.py

import json
import rethinkdb as r
import pytz
from datetime import datetime, timedelta

# The database and table are assumed to have been previously created
database_name = "sensor_db"
table_name = "sensor_data"

port_offset = 1         # To avoid interference of this testing program with the main program, all ports are initialized at an offset of 1 from the default ports using "rethinkdb --port_offset 1" at the command line.
conn = r.connect("localhost", 28015 + port_offset)

current_time = datetime.utcnow().replace(tzinfo=pytz.utc)   # Current time include timezone (assumed UTC)
retention_period = timedelta(days=3)                        # Period of time during which data is retained on the main server
expiry_time = current_time - retention_period               # Age of data which is removed from the main server

data_to_archive = r.db(database_name).table(table_name).filter(r.row['timestamp'] < expiry_time)
output_file = "archived_sensor_data.json"

with open(output_file, 'a') as f:
    for change in data_to_archive.changes().run(conn, time_format="raw"):        # The time_format="raw" option is passed to prevent a "RqlTzinfo object is not JSON serializable" error when dumping
        print change
        json.dump(change['new_val'], f)             # Since the main database we are reading from is append-only, the 'old_val' of the change is always None and we are interested in the 'new_val' only
        f.write("\n")                               # Separate entries by a new line

在运行此程序之前,我使用

启动了RethinkDB
rethinkdb --port_offset 1

在命令行,并使用localhost:8081处的Web界面创建一个名为sensor_db的数据库,其中包含一个名为sensor_data的表(见下文)。

enter image description here

运行rethinkdb_monitor.py并等待更改后,我运行一个生成合成数据的脚本rethinkdb_add_data.py

import random
import faker
from datetime import datetime, timedelta
import pytz
import rethinkdb as r

class RandomData(object):
    def __init__(self, seed=None):
        self._seed = seed
        self._random = random.Random()
        self._random.seed(seed)
        self.fake = faker.Faker()
        self.fake.random.seed(seed)

    def __getattr__(self, x):
        return getattr(self._random, x)

    def name(self):
        return self.fake.name()

    def datetime(self, start=None, end=None):
        if start is None:
            start = datetime(2000, 1, 1, tzinfo=pytz.utc)  # Jan 1st 2000
        if end is None:
            end = datetime.utcnow().replace(tzinfo=pytz.utc)

        if isinstance(end, datetime):
            dt = end - start
        elif isinstance(end, timedelta):
            dt = end
        assert isinstance(dt, timedelta)

        random_dt = timedelta(microseconds=self._random.randrange(int(dt.total_seconds() * (10 ** 6))))
        return start + random_dt

# Rethinkdb has been started at a port offset of 1 using the "--port_offset 1" argument.
port_offset = 1
conn = r.connect("localhost", 28015 + port_offset).repl()

rd = RandomData(seed=0)         # Instantiate and seed a random data generator

# The database and table have been previously created (e.g. through the web interface at localhost:8081)
database_name = "sensor_db"
table_name = "sensor_data"

# Generate random data with timestamps uniformly distributed over the past 6 days
random_data_time_interval = timedelta(days=6)
start_random_data = datetime.utcnow().replace(tzinfo=pytz.utc) - random_data_time_interval

for _ in range(5):
    entry = {"name": rd.name(), "timestamp": rd.datetime(start=start_random_data)}
    r.db(database_name).table(table_name).insert(entry).run()

使用Cntrl + C中断rethinkdb_monitor.py后,archived_sensor_data.json文件包含要归档的数据:

{"timestamp": {"timezone": "+00:00", "$reql_type$": "TIME", "epoch_time": 1475963599.347}, "id": "be2b5fd7-28df-48ee-b744-99856643265a", "name": "Elizabeth Woods"}
{"timestamp": {"timezone": "+00:00", "$reql_type$": "TIME", "epoch_time": 1475879797.486}, "id": "36d69236-f710-481b-82b6-4a62a1aae36c", "name": "Susan Wagner"}

然而,我仍在努力的是如何随后从数据库中删除此数据。 delete的命令语法似乎可以在表或选择上调用,但通过changefeed获得的change只是一个字典。

如何使用changefeed连续删除数据库中的数据?

1 个答案:

答案 0 :(得分:0)

我使用了这样一个事实:每个change都包含数据库中相应文档的ID,并使用get创建了一个带有此ID的选择:

with open(output_file, 'a') as f:
    for change in data_to_archive.changes().run(conn, time_format="raw"):        # The time_format="raw" option is passed to prevent a "RqlTzinfo object is not JSON serializable" error when dumping
        print change
        if change['new_val'] is not None:               # If the change is not a deletion
            json.dump(change['new_val'], f)             # Since the main database we are reading from is append-only, the 'old_val' of the change is always None and we are interested in the 'new_val' only
            f.write("\n")                               # Separate entries by a new line
            ID_to_delete = change['new_val']['id']                # Get the ID of the data to be deleted from the database
            r.db(database_name).table(table_name).get(ID_to_delete).delete().run(conn)

删除操作本身会被注册为更改,但我已使用if change['new_val'] is not None语句对其进行过滤。