- 6 Minutes to read
- DarkLight
How to Write a Transformation Pipeline Script
- 6 Minutes to read
- DarkLight
This article provides an overview of how to construct a data transformation script, leveraging a JSON-like syntax akin to MongoDB's aggregation framework. In Dataddo, transformations are defined through a series of stages within a JSON array, with each stage representing a specific data operation.
To create an effective transformation script, you can follow these structured steps:
Step 1: Understand Your Data
Step 2: Define Your Objectives and Choose Your Operators
Step 3: Create Transformation Stages
Step 4: Assemble the Pipeline
Step 5: Final Script
For practical applications and specific use cases, refer to the following sections:
- Specific Transformation Pipeline Example with an Explanation
- Transformation Pipeline Practical Examples
Step 1: Understand Your Data
Before choosing operators or defining stages, you must understand the structure and content of your data. Let's assume your dataset looks something like this:
{
"field1": [{"subfield1": "value1"}, {"subfield1": "value2"}], // An array of objects
"field2": "some text", // A simple string
"column1": "123", // A numeric string
"column2": 456 // A numeric value
}
Step 2: Define Your Objectives and Choose Your Operators
Decide what you want the transformed data to look like. In our example, you might want to:
Objective | Operator |
---|---|
Flatten field1 so each subfield1 value becomes part of a separate document. | Flatten an array: Use $unwind to deconstruct field1 . |
Transform field2 into all uppercase letters. | Transform a string: Use $toUpper to convert field2 to uppercase. |
Convert column1 to an integer type. | Type Casting: Use $toInt to change column1 from a string to an integer. |
Create a new field that is the sum of column1 and column2 . | Calculate a sum: Use $add to compute the sum of column1 and column2 . |
For an in-depth explanation of these operators and their functions, refer to the the list of transformation pipeline expression operators.
Step 3: Create Transformation Stages
Create your transformation stages. In this example, we will have four stages:
- Unwind Stage: Deconstructs an array field from the input documents to output a document for each element.
- Projection Stage: Used to include, exclude, or add new fields.
- Add Fields Stage (Renaming): Adds new fields or renames existing fields.
- Second Projection Stage (Cleanup): Removes fields that we no longer need in the output.
For an in-depth explanation of these stages and their functions, refer to the the list of transformation pipeline stages.
1. Unwind Stage
$unwind
is often used to deconstruct an array field from the input documents to output a document for each element. If field1
is an array, each element of this array becomes a separate document.
{
"$unwind": {
"path": "$field1",
"preserveNullAndEmptyArrays": true
}
}
2. Projection Stage
The $project
stage is used to include, exclude, or add new fields. Here we're also transforming field2
to uppercase and casting column1
to an integer.
In $project
, you specify key value pairs, where the key will be the label of your field and the value needs to specify the path where we can fill the value.
{
"$project": {
"newField1": "$field1.subfield1",
"transformedField2": {
"$toUpper": "$field2"
},
"newColumn1": {
"$toInt": "$column1"
},
"sumColumn": {
"$add": [
{ "$toInt": "$column1" },
"$column2"
]
},
"_id": 0
}
}
3. Add Fields Stage (Renaming)
The $addFields
stage adds new fields or renames existing fields. In this example, it's used to rename newField1 to renamedField1 and sumColumn to totalSum.
{
"$addFields": {
"renamedField1": "$newField1",
"totalSum": "$sumColumn"
}
}
4. Second Projection Stage (Cleanup)
The second $project
stage is used to remove fields that we no longer need in the output by setting them to 0, which excludes them.
{
"$project": {
"newField1": 0,
"sumColumn": 0
}
}
Step 4: Assemble the Pipeline
Combine all the stages into one array, in other words, enclosed in square brackets [ ]
, to form your pipeline.
[
{ /* Unwind Stage */ },
{ /* Projection Stage */ },
{ /* Add Fields Stage */ },
{ /* Second Projection Stage */ }
]
Step 5: Final Script
Recapitulation: In this example, we perform some basic operations on the following fields: field1
, field2
, column1
, and column2
. The basic operations are:
- Unwinding an array field
- Projecting certain fields
- Renaming fields
- Casting a string field to an integer
- Adding a new field with a computed value
[
// Stage 1: Unwind an array to normalize nested data
{
"$unwind": {
"path": "$field1", // Assumes field1 is an array
"preserveNullAndEmptyArrays": true // Optional parameter to keep items that are null or an empty array
}
},
// Stage 2: Project (select or transform) specific fields
{
"$project": {
"newField1": "$field1",
"transformedField2": {
"$toUpper": "$field2" // Example of transforming a string to uppercase
},
"newColumn1": {
"$toInt": "$column1" // Casting string to integer
},
"sumColumn": {
"$add": ["$column1", "$column2"] // Adding two fields together
},
"_id": 0 // Exclude the _id field from the output
}
},
// Stage 3: Rename fields for clarity
{
"$addFields": {
"renamedField1": "$newField1",
"totalSum": "$sumColumn"
}
},
// Stage 4: Remove fields no longer needed
{
"$project": {
"newField1": 0,
"sumColumn": 0
}
}
// Additional stages can be added as necessary...
]
Specific Example with an Explanation
- API Endpoint URL:
https://data.nasa.gov/resource/tfkf-kniw.json
- Authorization: Not required.
- HTTP Method:
GET
- HTTP Headers and Body: Not required.
Transformation script:
[
{
"$unset": "_id"
},
{
"$unwind": "$data"
},
{
"$project": {
"source_link_url": {
"$ifNull": [
"$data.source_link.url",
""
]
},
"event_date": {
"$dateToString": {
"date": {
"$toDate": {
"$ifNull": [
"$data.event_date",
"1970-01-01"
]
}
},
"format": "%Y-%m-%d %H:%M:%S"
}
},
"event_id": {
"$ifNull": [
"$data.event_id",
""
]
}
}
}
]
$unset
:- Purpose: Removes the
_id
field from all documents. - Reason: The
_id
field is automatically added but may be removed for anonymization or if it's not needed for further processing.
- Purpose: Removes the
$unwind
:- Purpose: Deconstructs an array field named
data
from the input documents, transforming each element of the array into a separate document. - Reason: Useful for operations where array elements need to be processed as individual documents.
- Purpose: Deconstructs an array field named
$project
:- Purpose: Reshapes each document by including the following new fields:
source_link_url
: Sets the value todata.source_link.url
or""
if it's null or doesn't exist.event_date
: Convertsdata.event_date
to a formatted date string or to"1970-01-01 00:00:00"
if it's null or doesn't exist.event_id
: Sets the value todata.event_id
or""
if it's null or doesn't exist.
- Reason: Ensures that the specified fields are present in a consistent format, providing defaults where necessary. This standardization can simplify future data handling tasks.
- Purpose: Reshapes each document by including the following new fields:
Non-Objects and Non-Array Fields
In the case of non-objects and non-array fields, the script can look like this:
{ "$project": {
"country_name": {
"$ifNull": [
"$data.country_name",
""
]
}
}
}
Howerver, when used in this example, you will get the following error.
cannot append value to column ID country_name: unhandled data type of slice. Please modify the transformation to parse this, column can't contain non-scalar data (objects, arrays)
Nested Objects or Array Fields
In case of nested objects or array fields, the script will need to be modified. In this example, the field source_link
is actually an object that contains field url. As such, we will need to flatten it.
{ "$project": {
"source_link_url": {
"$ifNull": [
"$data.source_link.url",
""
]
}
}
}
For this script to work, make sure you first unwind the data using:
{
"$unwind": "$data"
},