Connect with us

Data Cooker ETL

Apache Spark Application
for Cost Effective
Big Data Processing

Stop wasting time on 'no-code visual programming'. Define your ETL processes as a code on a declarative language instead. Write it in your favorite code editor, store in the git repo, and deploy via your favorite CI service


Custom SQL Dialect

ETL code is familiar from first SELECT

Language Specs


Open Source and simple object model

Fork Me On GitHub


Batch execution. Fully interactive debugging

Any Storage

Any Data Storage

Hadoop File Systems, S3-compatible*, JDBC*

* Get On GitHub

Low Cost of Ownership

  • Effective utilization of cluster resources
  • Ease of ETL code maintenance
  • Cost effective extension development
  • Simplicity of process testing and debugging

Connect with us

Ease of Implementation

  • Single FatJAR for all purposes
  • Permanent cluster not needed
  • We'll help with set up...
  • ...and with connection to data sources

Connect with us

Lean ETL Methodology

  • You don't need anything on the cluster, except Spark
  • Cloud? Hardware? Everything!
  • Send your script to the cluster and get the result
  • ...and have REPL for local debugging


SQL Dialect Tiered for ETL Tasks

  • Everything you expect from SQL
  • Familiar, so very comfortable
  • TRANSFORM operator, for data transformation
  • Object-oriented type system with geometry support

Language Specs

Data Catalog Not Needed

  • So there no schema maintenance costs
  • Everything is defined on the fly
  • ...and only if really needed
  • Mutate your data with easy heart


Data Set Formats

  • Column-based: Parquet, Delimited Text (CSV/TSV)
  • Text-based: PlainText
  • Structured: JSON
  • Geometric: GeoJSON, GPX


Storage adapters

  • Hadoop File Systems
  • S3-compatible*
  • Any DBMS or NoSQL via JDBC*

* Adapters are pluggable, and can be extended via Java API

* Get On GitHub

Ready To Use Algorithms

Data Cooker Operations are like SQL stored procedures or UDFs, except they're written using low-level Spark RDD API. They execute unbelievably fast comparing to mentioned things.

Date and time, geohashing, data series calculation, population statistics, geofencing, track data analysis — 22 Operations out of the box.

Fork Me On GitHub

Data Set Transformations

Support of pluggable Transforms (like Operations, written in Java) with object-oriented SELECT capabilities allow to flexibly and easily transform each supported data format into another one.

21 Transforms out of the box!

Fork Me On GitHub

Extensible Object Model

...we had two dozen of operations, same amount of transforms, a bunch of storage adapters, and a^W Oops, this is wrong genre.

What we really want to convey: if it's not enough out of the box — you may implement your own! Code is open, extension API is simple. Also, docs on the object model extensions will be generated automagically.

Fork Me On GitHub

A Multitude of Execution Modes

Batch Local, Batch On-Cluster, Local Interactive, Server On-Cluster, Interactive Console Client... all with additional options.

Simply speaking, single FatJAR includes everything necessary for testing and production environments. And considering simple REST protocol between Client and Server, you may easily integrate Data Cooker ETL in any browser-based Dashboard or Notebook your data analysts prefer.


Easy ETL Manageability

  • Variables are supported in all contexts of the language
  • Control flow with loops and branching
  • Parameter evaluation in the run-time
  • Smart data set partitioning in the run-time


Make a Contact

You may implement yourself, but you could subscribe

We're madly in love with Open Source and won't mind if you just fork the code and never call us back.

But also we have six years of real world production experience in a serious geoinformatic analysis project, which implemented Data Cooker ETL in the Amazon cloud. So we would share our expertise, accumulated through hundreds of ETL processes executed many thousands times, for a reasonable fee.

Connect with us