I need to store and to be able to query some very large amounts time series data.
Properties of the data are as follows:
- number of series : around 12.000 (twelve thousand)
- number of data points, globally : around 500.000.000 per month (five hundred millions)
- mixed value types: the majority of data points are floating point values, the rest are strings
- sampling period : variable between series as well as within a series
- data retention period : several years
- data archives need to be built in near realtime, but a reasonable delay (~1 hour) is acceptable
- past data can be rebuilt if needed, but at a high cost
- sometimes, but quite rarely, some past data needs to be updated
Properties of envisioned queries:
- most of the queries against the data will be timestamp-based queries; ranging from one day to several months/years. 90%+ will be queries on the most recent data
My initial thought thought was to use PyTables / Pandas instead of an SQL database.
Question : Assuming PyTables / Pandas is the "best" route, would it be better to split the data in several HDF files, each one spanning a given period of time, or put everything in a single file that would then become huge ?
And if that's not the best approach, how should I structure this data store or what technologies should I be considering? I'm not the first to tackle storing large sets of time series data, what is the general approach to resolving this challenge?
Aucun commentaire:
Enregistrer un commentaire