vendredi 9 janvier 2015

How to efficiently store big time series data?


I need to store and to be able to query some very large amounts time series data.


Properties of the data are as follows:



  • number of series : around 12.000 (twelve thousand)

  • number of data points, globally : around 500.000.000 per month (five hundred millions)

  • mixed value types: the majority of data points are floating point values, the rest are strings

  • sampling period : variable between series as well as within a series

  • data retention period : several years

  • data archives need to be built in near realtime, but a reasonable delay (~1 hour) is acceptable

  • past data can be rebuilt if needed, but at a high cost

  • sometimes, but quite rarely, some past data needs to be updated


Properties of envisioned queries:



  • most of the queries against the data will be timestamp-based queries; ranging from one day to several months/years. 90%+ will be queries on the most recent data




My initial thought thought was to use PyTables / Pandas instead of an SQL database.


Question : Assuming PyTables / Pandas is the "best" route, would it be better to split the data in several HDF files, each one spanning a given period of time, or put everything in a single file that would then become huge ?


And if that's not the best approach, how should I structure this data store or what technologies should I be considering? I'm not the first to tackle storing large sets of time series data, what is the general approach to resolving this challenge?





Aucun commentaire:

Enregistrer un commentaire