Transformation Pipeline
  • 4 Minutes to read
  • Dark
    Light

Transformation Pipeline

  • Dark
    Light

Article Summary

The Dataddo transformation pipeline is a framework for data transformation modeled on the concept of data processing pipelines. API responses from 3rd party services enter a multi-stage pipeline that transforms the response into a dataset ingestible by the Dataddo platform.

Using Universal Connectors in Dataddo, you can get your data from various sources or further adjust the transformation of your existing ones.

Examples

Depending on what kind of transformation of your data you are looking to do, your transformation pipeline will be different. Here are some examples:

Example 1

  1. The URL of the API endpoint is https://data.nasa.gov/resource/tfkf-kniw.json**.**
  2. As these data are freely accessible, we do not need any authorization.
  3. The HTTP Method needs to be set to GET as we want to fetch data.
  4. The HTTP Headers and Body can be left empty.

In this example, we fetch landslide data from NASA, setting the Transformation Pipeline so that we only extract the metrics relevant for us.

As for the Transformation Pipeline, let's say we want to collect country name, event date, event title, fatality count, injury count, landslide size, location description, and source name. Then the script should look something like this:

[
    {
        "$unwind": "$data"
    },
    {
        "$project": {
            "country_name": {
                "$ifNull": [
                    "$data.country_name",
                    ""
                ]
            },
            "event_date": {
                "$ifNull": [
                    "$data.event_date",
                    ""
                ]
            },
            "event_title": {
                "$ifNull": [
                    "$data.event_title",
                    ""
                ]
            },
            "fatality_count": {
                "$ifNull": [
                    "$data.fatality_count",
                    0
                ]
            },
            "injury_count": {
                "$ifNull": [
                    "$data.injury_count",
                    0
                ]
            },
            "landslide_size": {
                "$ifNull": [
                    "$data.landslide_size",
                    ""
                ]
            },
            "location_description": {
                "$ifNull": [
                    "$data.location_description",
                    ""
                ]
            },
            "source_name": {
                "$ifNull": [
                    "$data.source_name",
                    ""
                ]
            }
        }
    }
]

Example 2

Using the same URL as in Example 1, but this time, we will fetch division and country data from NASA, setting the Transformation Pipeline as follows:

[
    {
        "$unwind": "$data"
    },
    {
        "$project": {
            "admin_division_name": {
                "$ifNull": [
                    "$data.admin_division_name",
                    ""
                ]
            },
            "admin_division_population": {
                "$ifNull": [
                    "$data.admin_division_population",
                    ""
                ]
            },
            "country_code": {
                "$ifNull": [
                    "$data.country_code",
                    ""
                ]
            },
            "country_name": {
                "$ifNull": [
                    "$data.country_name",
                    ""
                ]
            },
            "created_date": {
                "$ifNull": [
                    "$data.created_date",
                    ""
                ]
            }
        }
    }
]

Example 3

This transformation aims to get data on the first name, absence, cost centers from Personio using the official URL which looks like the following:

https://api.personio.de/v1/company/employees?limit=200&offset=0

This time, we had to authorize the account to be able to pull the data.

[
    {
        "$unwind": "$data"
    },
    {
        "$unwind": {
            "path": "$data.attributes.absence_entitlement.value",
            "preserveNullAndEmptyArrays": true
        }
    },
    {
        "$unwind": {
            "path": "$data.attributes.cost_centers.value",
            "preserveNullAndEmptyArrays": true
        }
    },
    {
        "$unset": "_id"
    },
    {
        "$project": {
            "email": {
                "$ifNull": [
                    "$data.attributes.email.value",
                    ""
                ]
            },
            "first_name": {
                "$ifNull": [
                    "$data.attributes.first_name.value",
                    ""
                ]
            },
            "absence_entitlemen_category": {
                "$ifNull": [
                    "$data.attributes.absence_entitlement.value.attributes.category",
                    ""
                ]
            },
            "absence_entitlement_entitlement": {
                "$ifNull": [
                    "$data.attributes.absence_entitlement.value.attributes.entitlement",
                    0
                ]
            },
            "cost_centers_id": {
                "$ifNull": [
                    "$data.attributes.cost_centers.value.attributes.id",
                    0
                ]
            },
            "cost_centers_name": {
                "$ifNull": [
                    "$data.attributes.cost_centers.value.attributes.name",
                    ""
                ]
            }
        }
    }
]

Stages

The Dataddo transformation pipeline consists of multiple stages. Each stage transforms the API response as it passes through the pipeline. Pipeline stages can appear multiple times in the pipeline.

StagesDescription
$projectReshapes API response in the stream. e.g. adding new fields or removing existing fields.
$unwindDeconstructs an array field from the API response to object for each element. Each object replaces the array with an element value. For each input, outputs N objects where N is the number of array elements and can be zero for an empty array.
$matchFilters the stream to allow only matching objects to pass unmodified into the next pipeline stage. For each input object, outputs either one object (a match) or removes the object from a stream (no match).
$groupGroups input objects by a specified identifier expression and applies the accumulator expression(s), if specified, to each group. Consumes all input objects and outputs one object per each distinct group. The output objects only contain the identifier field and, if specified, accumulated fields.

Expressions

Some pipeline stages take a pipeline expression as the operand. Pipeline expressions specify the transformation to apply to the input objects. Expressions have a standard JSON object structure and can contain other expressions.

Pipeline expressions can only operate on the current object in the pipeline and cannot refer to data from other objects: expression operations provide in-memory transformation of objects.

Expression operators

Arithmetic

Arithmetic expressions perform mathematic operations on numbers.

OperatorDescription
$absReturns the absolute value of a number.
$addAdds numbers to return the sum.
$subtractReturns the result of subtracting the second value from the first.
$ceilReturns the smallest integer greater than or equal to the specified number.
$floorReturns the largest integer less than or equal to the specified number.

Boolean

Boolean expressions evaluate their argument expressions as booleans and return a boolean as a result.

OperatorDescription
$andReturns true only when all its expressions evaluate to be true. Accepts any number of argument expressions.
$notReturns the boolean value that is the opposite of its argument expression. Accepts a single argument expression.
$orReturns true when any of its expressions evaluates to true. Accepts any number of argument expressions.

Comparison

OperatorDescription
$cmpReturns 0 if the two values are equivalent, 1 if the first value is greater than the second, and -1 if the first value is less than the second.
$eqReturns true if the values are equivalent.
$gtReturns true if the first value is greater than the second.
$gteReturns true if the first value is greater than or equal to the second.
$ltReturns true if the first value is less than the second.
$lteReturns true if the first value is less than or equal to the second.
$neReturns true if the values are not equivalent.

Free consultation available!

Feel free to contact us. We are happy to help you with the data transformations pipeline setup.


Was this article helpful?

What's Next