Programmers: How to efficiently store big time series data?

vendredi 9 janvier 2015

How to efficiently store big time series data?

I need to store and to be able to query some very large amounts time series data.

Properties of the data are as follows:

number of series : around 12.000 (twelve thousand)

number of data points, globally : around 500.000.000 per month (five hundred millions)

mixed value types: the majority of data points are floating point values, the rest are strings

sampling period : variable between series as well as within a series

data retention period : several years

data archives need to be built in near realtime, but a reasonable delay (~1 hour) is acceptable

past data can be rebuilt if needed, but at a high cost

sometimes, but quite rarely, some past data needs to be updated

Properties of envisioned queries:

most of the queries against the data will be timestamp-based queries; ranging from one day to several months/years. 90%+ will be queries on the most recent data

My initial thought thought was to use PyTables / Pandas instead of an SQL database.

Question : Assuming PyTables / Pandas is the "best" route, would it be better to split the data in several HDF files, each one spanning a given period of time, or put everything in a single file that would then become huge ?

And if that's not the best approach, how should I structure this data store or what technologies should I be considering? I'm not the first to tackle storing large sets of time series data, what is the general approach to resolving this challenge?

Programmers

vendredi 9 janvier 2015

How to efficiently store big time series data?

Aucun commentaire:

Enregistrer un commentaire