应用错误收集

My use case is the following: I run about 60 websockets from 7 data sources in parallel that record stock tickers (so time-series data). Currently, I'm writing the data into a mongodb that is hosted on a Google Cloud VM such that every data source has its own collection and all collections are hosted inside the same database.

However, the database has grown to 0.6 GB and ~ 10 million rows after only five days of data. I'm pretty new to such questions, but I have a feeling that this is not a viable long-term solution. I will never need all of the data at once, but I need all of the data in order to query by date / currency. However, as I understood those queries might become impossible once the dataset is bigger than my RAM, is that true?

Moreover, this is a research project, but unfortunately I'm currently not able to use a university cluster, therefore I'm hosting the data on a private VM. However, this is subject to a budget constraint, and highly performant machines quickly become very expensive. That's why I'm questioning my design choice. Currently, I'm thinking of either switching to another kind of database, but fear that I'm running into the same issues again, or exporting the database once per week / month / whatever to CSV and wiping out. This would be quite a hastle though and I'm also scared of losing data.

So my question is, how can I design this database such that I can subset the data per one of the keys (either datetime or ticker_id) even when the database grows larger than my machine's RAM? Diskspace is not an issue.

Scale database that receives streaming data with small resources

1 个答案: