The INDIGO PaaS Orchestrator will be extended to support EGI-Checkin as identity provider for the PaaS services that require the creation of a client. The development already introduced in Orchestrator has been done to accept different providers but a plugin is needed to support the entire workflow for EGI Check-in as well.
The INDIGO PaaS Orchestrator will be enhanced in order to support a monitoring and Metering System.
● Monitoring and Ranking system with integrated AI
In the default configuration, the PaaS-Orchestrator determines the provider to which the deployment creation request is submitted based on an ordered list of providers, selected according to the group the user belongs to. This list is provided by the Cloud Provider Rankerservice, which applies a ranking algorithm using a limited set of metrics related to deployments and the Service Level Agreements defined for the providers. The INDIGO PaaS Orchestrator submits the deployment request to the first provider in the list, and in case of failure, it moves to the next provider until the list is exhausted.
The new AI-ranker service aims to improve the current ranking system and optimize resource usage by adopting Artificial Intelligence (AI) techniques. In particular, we want the AI-ranker, using a proper set of metrics and AI algorithms, to provide the Orchestrator with a list of ranked providers that aims to minimize deployment errors and the time required to create a deployment.
In this context, significant preparatory work was carried out to identify the most relevant metrics, as well as the sources from which these metrics can be obtained. The main sources we identified are: Orchestrator logs, Orchestrator Dashboard DB, and monitoring service.
Some of the metrics we identified are: user resource demands, user group resource quotas, and the current resource usage by user group for the allowed providers (in terms of CPU, RAM, volumes, FIPs).
The subsequent preparation of the dataset allowed us to study the use case and to identify and compare various AI techniques. The proposed approach involves the creation of two models: a model for the classification of deployment success/failure and a regression model for the deployment creation time. A combination of the outputs of the two models allows for the definition of an ordered list of providers that the orchestrator can use for submitting the deployments. Currently, the models that yielded the best scores are the RandomForest Classifier and Regressor. Moreover, we computed the feature importance score which allowed us to remove some of the useless features.
Then we started to work on the automatization of metrics collection, using Kafka as the main technology. To reduce the number of sources to contact, we moved the useful information stored in the Orchestrator Dashboard DB directly into the Orchestrator logs. This work will allow us to get a dataset for AI workflows simply by contacting Kafka.
We decided to use MLFlow[1], also used by itwinai – the AI component from WP6, to store Machine Learning (ML) models and as the core technology to perform training and inference. Currently we are developing the following components that make up the AI-ranker.
- AI-ranker-registry: ML model registry that allows to store trained ML models and related metrics, compare them, and attach metadata.
- AI-ranker-training: a ML training script that is triggered once a new message is uploaded into the training queue of Kafka. When the training is completed, the AI-ranker-training saves the trained ML model into the AI-ranker-registry.
- AI-ranker-inference: a ML inference script that is triggered once a new message is uploaded into the inference queue of Kafka. The ML model used for inference is taken from the AI-ranker-registry.
We plan to release the entire AI-ranker service by July 2025.
● Integration with InterLink offload approach
PaaS will be enhanced in order to allow deployment of k8s based service with embedded virtual kubelet setup. So PaaS deployed k8s based services are transparently configured to offload via interLink.
● Data Streamer
Since the workflow manager used by the PaaS-Orchestrator is quite complex, we are evaluating the usage of Kafka as a data streamer between the multiple components that will be involved in the new architecture. The advantages of these services involves the high availability and robustness of the service. Moreover, we expect to easily integrate new services extending the number of topics.
The first use cases for this tool foresee:
- The collection of the results of the tests executed by rally probes on the providers for monitoring purposes
- The collection of the PaaS-Orchestrator deployments logs to build the training set that will be used by the new AI-Ranker service.
We plan to release a stable version of this service by July 2025