Updated 15/02/2024

DTE Infrastructure Component

AI Based Orchestrator

Intelligent Providers Orchestration

Description

Collecting high-level deployment requests and translates them into action to coordinate resources interacting with the underlying cloud infrastructures.

INDIGO – PaaS Orchestrator is the core component of the INDIGO PaaS layer. It collects high-level deployment requests and translates them into action to coordinate resources interacting with the underlying cloud infrastructures.

It allows the provisioning of virtualized compute and storage resources on different Cloud Management Frameworks (like OpenStack, OpenNebula, AWS, MS Azure, Google Cloud, etc.). The PaaS orchestrator features advanced federation and scheduling capabilities. It ensures transparent access to heterogeneous cloud environments and the selection of resource providers based on criteria like user’s SLAs, services availability, special hardware availability and data location.

It manages deployment requests, expressed through templates written in TOSCA, the standard language for describing application topologies in cloud, and coordinates the deployment on the most suitable cloud site. To achieve this:

it gathers SLAs, monitoring data and additional information from other platform services;
it asks the cloud provider ranker for a list of the best cloud sites.

The INDIGO PaaS Orchestrator open source is a scientific gateway that allows users to easily access federated cloud systems, on top of which both simple and complex scientific virtual computational environments can be automatically deployed and configured. The provided dashboard implements a catalogue of services. When a user selects a specific service from the catalogue, the PaaS Orchestrator processes the inputs defined in the corresponding TOSCA template, completely automates the interaction with the backends, and hides the related complexity. The dashboard allows to simply define customised views of the Service Catalogue.

Target Audience

In principle, all DT users that use the clouds to deploy their high-level services. Depending on the use cases, DT users may interact directly with the PaaS Orchestrator via Dashboard interface.

Documentation

https://indigo-dc.gitbook.io/indigo-paas-orchestrator/

License

Apache 2.0

Source Code

https://github.com/indigo-dc/orchestrator

Created by

INFN

Release Notes

● Release Notes

The Orchestrator collects all the information needed to deploy the virtual infra/service/job consuming other PaaS APIs. It depends on external services such as:

SLAM Service: get the prioritised list of SLAs per user/group;
Configuration Management DB: get the capabilities of the underlying IaaS platforms;
Data Management Service: get the status of the data files and storage resources needed by the service/application;
Monitoring Service: get the IaaS services availability and their metrics;
CloudProviderRanker Service (Rule Engine): sort the list of sites on the basis of rules defined per user/group/use-case.

The orchestrator-dashboard Simple Graphical UI is a Python application built with the Flask microframework. Flask-Dance is used for OpenID-Connect/OAuth2 integration supporting the following functionalities:

IAM authentication
Display user’s deployments
Display deployment details, template, and log
Delete deployment
Create new deployment

● The Orchestrator has been updated with the following improvements

The possibility to manage the creation/deletion of IAM clients via Orchestrator. The INDIGO-IAM [INDIGO-IAM] client resource is completely managed by the Orchestrator: when a user requests the creation of a deployment requiring an INDIGO-IAM client, the Orchestrator creates the client, and when the user triggers the deletion of the deployment the Orchestrator deletes the corresponding INDIGO-IAM client;
Added a parameter to force the deletion of a deployment. This may not delete some resources, for example IAM clients;
Added a group authorization. Now the check is directly performed by the Orchestrator that stops the user’s request when the used group is not allowed for the user;
Added skipping monitoring workflow when the monitoring url is not specified;
Metadata has been enriched with the information of preferred_username contained in the user’s token

Additional developments have been performed to integrate the Orchestrator and the Federation Registry. In particular, the following main actions have been performed:

Creation of the Data Type Objects (DTO) that defines the entities required by the Federation Registry (especially those returned by the API that the Orchestrator will contact);
Creation of functions that map the entities returned from the Federation Registry to what the Orchestrator expects and that will be used in the other steps of the workflow.

Last release (v.4.0.0-RC.1):

https://github.com/infn-datacloud/orchestrator/releases/tag/v4.0.0-RC.1

● PaaS Dashboard

It is a Flask^[1] application with a SQL database (MySQL) that enables users to interact easily with the services of the PaaS, particularly the Orchestrator, to create TOSCA-based deployments. The dashboard provides a user-friendly interface for managing and monitoring deployments.

The Dashboard has been updated with the following improvements:

renewed graphics
deployment ports management
improved nodes management for cluster deployments
integration with the Federation Registry as an alternative to the CMDB and the SLAT micro services
added deployment status reset functionality
bugs fixing

Latest release (v4.3.0-RC1):

https://github.com/infn-datacloud/orchestrator-dashboard/releases/tag/v4.3.0-RC1

● Federation Registry

Efforts have been made to replace the Service Level Agreement Tool (SLAT) and the Configuration Management Database (CMDB) since they are outdated tools with vulnerabilities. The Federation Registry [FederationRegistry], that is a Python web application providing public REST API to inspect the configurations of the federated providers and map the resources available to each user group. Previously the automatic population of the CMDB was performed by the Cloud Info Provider (CIP), now the Federation Registry Feeder, a Python script, contacts the federated providers and populates the Federation Registry.

Latest release:

Federation Registry (v.1.0.0-beta.2) – https://github.com/infn-datacloud/federation-registry/releases/tag/v1.0.0-beta.2
Federation Registry Feeder (v1.0.0-beta.5) – https://github.com/infn-datacloud/federation-registry-feeder/releases/tag/v1.0.0-beta.5

Future Plans

The INDIGO PaaS Orchestrator will be extended to support EGI-Checkin as identity provider for the PaaS services that require the creation of a client. The development already introduced in Orchestrator has been done to accept different providers but a plugin is needed to support the entire workflow for EGI Check-in as well.

The INDIGO PaaS Orchestrator will be enhanced in order to support a monitoring and Metering System.

● Monitoring and Ranking system with integrated AI

In the default configuration, the PaaS-Orchestrator determines the provider to which the deployment creation request is submitted based on an ordered list of providers, selected according to the group the user belongs to. This list is provided by the Cloud Provider Rankerservice, which applies a ranking algorithm using a limited set of metrics related to deployments and the Service Level Agreements defined for the providers. The INDIGO PaaS Orchestrator submits the deployment request to the first provider in the list, and in case of failure, it moves to the next provider until the list is exhausted.

The new AI-ranker service aims to improve the current ranking system and optimize resource usage by adopting Artificial Intelligence (AI) techniques. In particular, we want the AI-ranker, using a proper set of metrics and AI algorithms, to provide the Orchestrator with a list of ranked providers that aims to minimize deployment errors and the time required to create a deployment.

In this context, significant preparatory work was carried out to identify the most relevant metrics, as well as the sources from which these metrics can be obtained. The main sources we identified are: Orchestrator logs, Orchestrator Dashboard DB, and monitoring service.

Some of the metrics we identified are: user resource demands, user group resource quotas, and the current resource usage by user group for the allowed providers (in terms of CPU, RAM, volumes, FIPs).

The subsequent preparation of the dataset allowed us to study the use case and to identify and compare various AI techniques. The proposed approach involves the creation of two models: a model for the classification of deployment success/failure and a regression model for the deployment creation time. A combination of the outputs of the two models allows for the definition of an ordered list of providers that the orchestrator can use for submitting the deployments. Currently, the models that yielded the best scores are the RandomForest Classifier and Regressor. Moreover, we computed the feature importance score which allowed us to remove some of the useless features.

Then we started to work on the automatization of metrics collection, using Kafka as the main technology. To reduce the number of sources to contact, we moved the useful information stored in the Orchestrator Dashboard DB directly into the Orchestrator logs. This work will allow us to get a dataset for AI workflows simply by contacting Kafka.

We decided to use MLFlow^[1], also used by itwinai – the AI component from WP6, to store Machine Learning (ML) models and as the core technology to perform training and inference. Currently we are developing the following components that make up the AI-ranker.

AI-ranker-registry: ML model registry that allows to store trained ML models and related metrics, compare them, and attach metadata.
AI-ranker-training: a ML training script that is triggered once a new message is uploaded into the training queue of Kafka. When the training is completed, the AI-ranker-training saves the trained ML model into the AI-ranker-registry.
AI-ranker-inference: a ML inference script that is triggered once a new message is uploaded into the inference queue of Kafka. The ML model used for inference is taken from the AI-ranker-registry.

We plan to release the entire AI-ranker service by July 2025.

● Integration with InterLink offload approach

PaaS will be enhanced in order to allow deployment of k8s based service with embedded virtual kubelet setup. So PaaS deployed k8s based services are transparently configured to offload via interLink.

● Data Streamer

Since the workflow manager used by the PaaS-Orchestrator is quite complex, we are evaluating the usage of Kafka as a data streamer between the multiple components that will be involved in the new architecture. The advantages of these services involves the high availability and robustness of the service. Moreover, we expect to easily integrate new services extending the number of topics.

The first use cases for this tool foresee:

The collection of the results of the tests executed by rally probes on the providers for monitoring purposes
The collection of the PaaS-Orchestrator deployments logs to build the training set that will be used by the new AI-Ranker service.

We plan to release a stable version of this service by July 2025