Scale database that receives streaming data with small resources

时间:2019-04-08 12:48:54

标签: database mongodb database-design bigdata scalability

My use case is the following: I run about 60 websockets from 7 data sources in parallel that record stock tickers (so time-series data). Currently, I'm writing the data into a mongodb that is hosted on a Google Cloud VM such that every data source has its own collection and all collections are hosted inside the same database.

However, the database has grown to 0.6 GB and ~ 10 million rows after only five days of data. I'm pretty new to such questions, but I have a feeling that this is not a viable long-term solution. I will never need all of the data at once, but I need all of the data in order to query by date / currency. However, as I understood those queries might become impossible once the dataset is bigger than my RAM, is that true?

Moreover, this is a research project, but unfortunately I'm currently not able to use a university cluster, therefore I'm hosting the data on a private VM. However, this is subject to a budget constraint, and highly performant machines quickly become very expensive. That's why I'm questioning my design choice. Currently, I'm thinking of either switching to another kind of database, but fear that I'm running into the same issues again, or exporting the database once per week / month / whatever to CSV and wiping out. This would be quite a hastle though and I'm also scared of losing data.

So my question is, how can I design this database such that I can subset the data per one of the keys (either datetime or ticker_id) even when the database grows larger than my machine's RAM? Diskspace is not an issue.

1 个答案:

答案 0 :(得分:1)

除了Alex Blex已经对存储和性能发表的评论。

查询响应时间(在5天内有近1000万行)将随着数据集的增长而恶化。您可以查看sharding将表分解为合理的块,并且仍然具有对所有数据的优缺点,以进行查询。