Data Cooker ETL

Apache Spark Application
for Cost Effective
Big Data Processing

Stop wasting time on 'no-code visual programming'. Define your ETL processes as a code on a declarative language instead. Write it in your favorite code editor, store in the git repo, and deploy via your favorite CI service

Documentation

Custom SQL Dialect

ETL code is familiar from first SELECT

Language Specs

Extensible

Open Source and simple object model

Fork Me On GitHub

CLI with REPL

Batch execution. Fully interactive debugging

Documentation

Any Data Storage

Hadoop File Systems, S3-compatible*, JDBC*

* Get On GitHub

Low Cost of Ownership

Effective utilization of cluster resources

Ease of ETL code maintenance

Cost effective extension development

Simplicity of process testing and debugging

Connect with us

Ease of Implementation

Single FatJAR for all purposes

Permanent cluster not needed

We'll help with set up...

...and with connection to data sources

Connect with us

Lean ETL Methodology

You don't need anything on the cluster, except Spark

Cloud? Hardware? Everything!

Send your script to the cluster and get the result

...and have REPL for local debugging

Documentation

SQL Dialect Tiered for ETL Tasks

Everything you expect from SQL

Familiar, so very comfortable

TRANSFORM operator, for data transformation

Object-oriented type system with geometry support

Language Specs

Data Catalog Not Needed

So there no schema maintenance costs

Everything is defined on the fly

...and only if really needed

Mutate your data with easy heart

Documentation

Data Set Formats

Column-based: Parquet, Delimited Text (CSV/TSV)

Text-based: PlainText

Structured: JSON

Geometric: GeoJSON, GPX

Documentation

Storage adapters

Hadoop File Systems

S3-compatible*

Any DBMS or NoSQL via JDBC*

* Adapters are pluggable, and can be extended via Java API

* Get On GitHub

Ready To Use Algorithms

Data Cooker Operations are like SQL stored procedures or UDFs, except they're written using low-level Spark RDD API. They execute unbelievably fast comparing to mentioned things.

Date and time, geohashing, data series calculation, population statistics, geofencing, track data analysis — 22 Operations out of the box.

Fork Me On GitHub

Data Set Transformations

Support of pluggable Transforms (like Operations, written in Java) with object-oriented SELECT capabilities allow to flexibly and easily transform each supported data format into another one.

21 Transforms out of the box!

Fork Me On GitHub

Extensible Object Model

...we had two dozen of operations, same amount of transforms, a bunch of storage adapters, and a^W Oops, this is wrong genre.

What we really want to convey: if it's not enough out of the box — you may implement your own! Code is open, extension API is simple. Also, docs on the object model extensions will be generated automagically.

Fork Me On GitHub

A Multitude of Execution Modes

Batch Local, Batch On-Cluster, Local Interactive, Server On-Cluster, Interactive Console Client... all with additional options.

Simply speaking, single FatJAR includes everything necessary for testing and production environments. And considering simple REST protocol between Client and Server, you may easily integrate Data Cooker ETL in any browser-based Dashboard or Notebook your data analysts prefer.

Documentation

Easy ETL Manageability

Variables are supported in all contexts of the language

Control flow with loops and branching

Parameter evaluation in the run-time

Smart data set partitioning in the run-time

Documentation

Make a Contact

You may implement yourself, but you could subscribe

We're madly in love with Open Source and won't mind if you just fork the code and never call us back.

But also we have six years of real world production experience in a serious geoinformatic analysis project, which implemented Data Cooker ETL in the Amazon cloud. So we would share our expertise, accumulated through hundreds of ETL processes executed many thousands times, for a reasonable fee.

Connect with us

Data Cooker ETL

Apache Spark Applicationfor Cost Effective Big Data Processing

Custom SQL Dialect

Extensible

CLI with REPL

Any Data Storage

Low Cost of Ownership

Effective utilization of cluster resources Ease of ETL code maintenance Cost effective extension development Simplicity of process testing and debugging

Ease of Implementation

Single FatJAR for all purposes Permanent cluster not needed We'll help with set up... ...and with connection to data sources

Lean ETL Methodology

You don't need anything on the cluster, except Spark Cloud? Hardware? Everything! Send your script to the cluster and get the result ...and have REPL for local debugging

SQL Dialect Tiered for ETL Tasks

Everything you expect from SQL Familiar, so very comfortable TRANSFORM operator, for data transformation Object-oriented type system with geometry support

Data Catalog Not Needed

So there no schema maintenance costs Everything is defined on the fly ...and only if really needed Mutate your data with easy heart

Data Set Formats

Column-based: Parquet, Delimited Text (CSV/TSV) Text-based: PlainText Structured: JSON Geometric: GeoJSON, GPX

Storage adapters

Hadoop File Systems S3-compatible* Any DBMS or NoSQL via JDBC*

Ready To Use Algorithms

Data Set Transformations

Extensible Object Model

A Multitude of Execution Modes

Easy ETL Manageability

Variables are supported in all contexts of the language Control flow with loops and branching Parameter evaluation in the run-time Smart data set partitioning in the run-time

Make a Contact

You may implement yourself, but you could subscribe

Apache Spark Application
for Cost Effective
Big Data Processing

Effective utilization of cluster resources

Ease of ETL code maintenance

Cost effective extension development

Simplicity of process testing and debugging

Single FatJAR for all purposes

Permanent cluster not needed

We'll help with set up...

...and with connection to data sources

You don't need anything on the cluster, except Spark

Cloud? Hardware? Everything!

Send your script to the cluster and get the result

...and have REPL for local debugging

Everything you expect from SQL

Familiar, so very comfortable

TRANSFORM operator, for data transformation

Object-oriented type system with geometry support

So there no schema maintenance costs

Everything is defined on the fly

...and only if really needed

Mutate your data with easy heart

Column-based: Parquet, Delimited Text (CSV/TSV)

Text-based: PlainText

Structured: JSON

Geometric: GeoJSON, GPX

Hadoop File Systems

S3-compatible*

Any DBMS or NoSQL via JDBC*

Variables are supported in all contexts of the language

Control flow with loops and branching

Parameter evaluation in the run-time

Smart data set partitioning in the run-time