The goal of the project is to enable a data science team to independently build and operate data pipelines to map complex ETL/ELT processes using Argo Workflow Engine. The development will initially take place on a cloud infrastructure and will then continue with the experience gained in a cloud native solution. PTA is responsible for both the installation and operation of an interim solution based on a single-node system and the planned migration to a Google Cloud Platform (GCP) cloud service.
PTA supports the customer in installing all relevant system components (Docker, Minikube, Argo Workflows, MinIO) in the sense of a test system. A prerequisite for installing Argo Workflows is a Kubernetes distribution. For this, Docker and the single-node Kubernetes distribution 'Minikube' are installed on a dedicated virtual machine (VM). In addition, MinIO is installed on the VM as a cloud-compatible object store, providing the ability to exchange data between the processing units (steps) of a workflow. In addition, PTA is working with the customer to develop productive data pipelines or ETL/ELT processes using Argo Workflows. After successful testing of Argo Workflows, PTA is responsible for migrating the workflow engine to the Google Cloud using the Google Kubernetes Engine (GKE).
Requests from the business departments often require the development of complex data pipelines, which are difficult to realize with common ETL/ELT tools such as Oracle Data Integrator. The orchestration of procedures from the field of data science to complex workflows often requires flexibility in development, which classic ETL/ELT tools rarely offer. For this reason, the customer has chosen 'Argo Workflows'. Argo Workflows is a container-native open source workflow engine for orchestrating parallel jobs on Kubernetes. With Argo Workflows, process flows (workflows) can be defined where each step in the workflow is a container. This allows workflows to be developed whose steps use different versions of libraries or technologies. Argo Workflows can also be used to model multi-step workflows as a sequence of steps, including their interdependencies, using directed acyclic graphs (DAG).