Author Topic: Python big data library (ideally S3 server compatible)?  (Read 785 times)

0 Members and 2 Guests are viewing this topic.

Offline Martin FTopic starter

  • Regular Contributor
  • *
  • Posts: 149
  • Country: dk
Python big data library (ideally S3 server compatible)?
« on: April 08, 2019, 01:09:07 pm »
Hi all,

we're doing some tests on an AWS S3 server with large amounts of log file data with the following properties:
- The time series data is stored in binary files (MDF 4.1) in chunks of 5-50 MB size
- We need to process data across potentially hundreds of files and several GBs of raw data
- From this raw data, we need to e.g. extract specific information, e.g. statistical details for parts of the data or plotting some of several time series
- We have various libraries available already (for operating with MDF files and for operating with S3 in python)

What we're missing is an open source module/API for python that is suitable for working with very large data sets, ideally with a built-in S3 integration. We'd like to be able to e.g. specify that we need data from date X to Y - and then process the data relevant to that period, without having to download all the relevant files locally.

I know this is a rather broad question, but would be happy for any pointers to guides/modules for working with this kind of data scenario in python.

Thanks,
Martin
 

Offline Tadas

  • Contributor
  • Posts: 17
Re: Python big data library (ideally S3 server compatible)?
« Reply #1 on: April 08, 2019, 08:49:24 pm »
S3 is pretty good for stuff like this, especially if you plan to use other AWS services for processing it (i.e. make sure your S3 bucket and say EC2 instances are in the same data center).

Since it's time series data, can you chunk those to say fixed (for example 10 minute) pieces? Ideally your chunks should be somewhat easy to handle size. So it's effortless to pull one for local processing. But not too small as well, so downloading an average working set does not mean pulling tens or hundreds of thousands of files (if files are too small - number of http requests can become overhead). Something in range of 1 to 100 megs should do fine.

Then it's a question of picking good file naming strategy. For example, system logs are commonly put into something in lines of: yyyy/mm/dd/HH/(00|10|20|30|40|50).gz (given 10 minute log volume has a manageable size like described above).

I'd be still pulling those files into local file system (or memory) for processing. I don't think you can use S3 as your local file system conveniently.
 

Offline Martin FTopic starter

  • Regular Contributor
  • *
  • Posts: 149
  • Country: dk
Re: Python big data library (ideally S3 server compatible)?
« Reply #2 on: April 10, 2019, 10:31:24 am »
Hi again, thanks a lot.

The files can be a configurable size (1-512 MB per file).

Our challenge is that an end user may want to perform processing (e.g. calculating statistics) across a very large set of files - e.g. 100 files comprising 5 GB in total. Loading all these files and processing them in full would be rather tedious, as the user may only want to fetch a specific parameter (say Vehicle Speed) out of potentially 40+ parameters.

We looked at the S3 SELECT API, which performs SQL like queries on the data. However our log file format is not amongst the supported (CSV, JSON, Parquet). Rather, our log files are in a binary format and will typically be processed via Python using a library intended for that file format.

I guess our challenge is on a couple of aspects:
1) Is there a simple way for user to operate with select data locally - without fetching the entire bulk - when our log file format is non-standard?
2) Should our data log files be "converted" or "moved" into some form of AWS database structure for easier processing? Is that possible for a custom data format?

Thanks again,
Martin
 

Offline olkipukki

  • Frequent Contributor
  • **
  • Posts: 790
  • Country: 00
Re: Python big data library (ideally S3 server compatible)?
« Reply #3 on: April 10, 2019, 11:09:12 am »

2) Should our data log files be "converted" or "moved" into some form of AWS database structure for easier processing? Is that possible for a custom data format?


Have a look AWS Glue, Athena, EMR...
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf