Hi all,
we're doing some tests on an AWS S3 server with large amounts of log file data with the following properties:
- The time series data is stored in binary files (MDF 4.1) in chunks of 5-50 MB size
- We need to process data across potentially hundreds of files and several GBs of raw data
- From this raw data, we need to e.g. extract specific information, e.g. statistical details for parts of the data or plotting some of several time series
- We have various libraries available already (for operating with MDF files and for operating with S3 in python)
What we're missing is an open source module/API for python that is suitable for working with very large data sets, ideally with a built-in S3 integration. We'd like to be able to e.g. specify that we need data from date X to Y - and then process the data relevant to that period, without having to download all the relevant files locally.
I know this is a rather broad question, but would be happy for any pointers to guides/modules for working with this kind of data scenario in python.
Thanks,
Martin