The goal is to simplify the integration and scaling of big data and AI workflows onto the hybrid cloud, the company said.
IBM Wednesday announced CodeFlare, an open-source, serverless framework designed to simplify the integration and efficient scaling of big data and AI workflows onto the hybrid cloud. CodeFlare is built on top of an emerging open-source distributed computing framework for machine learning applications known as Ray.
IBM said CodeFlare extends the capabilities of Ray by adding specific elements to make scaling workflows easier.
With data and machine learning analytics are proliferating into just about every industry, tasks are becoming far more complex, IBM noted. While it is important to have larger datasets and more systems designed for AI research, as these workflows become more involved, researchers are spending more and more time configuring their setups than getting data science done.
To create a machine learning model today, researchers and developers have to train and optimize the model first, IBM said. This might involve data cleaning, feature extraction and model optimization. CodeFlare aims to simplify this process using a Python-based interface IBM refers to as a pipeline by making it easier to integrate, parallelize and share data.
The company said the goal of its new framework is to unify pipeline workflows across multiple platforms without requiring data scientists to learn a new workflow language.
CodeFlare pipelines run on IBM’s new serverless platform IBM Cloud Code Engine, and Red Hat OpenShift. This allows users to deploy CodeFlare almost anywhere, extending the benefits of serverless to data scientists and AI researchers, IBM said.
This also makes it easier to integrate and bridge with other cloud-native ecosystems by providing adapters to event triggers such as the arrival of a new file, and load and partition data from a wide range of sources, such as cloud object storages, data lakes and distributed file systems, the company said.
CodeFlare “goes beyond isolated tasks to seamlessly integrate and scale end-to-end pipelines with a data-scientist-friendly interface–like Python–instead of using containers,” said Priya Nagpurkar, director, hybrid cloud platform at IBM Research. “CodeFlare can provide a simpler way to integrate and scale full pipelines, while offering a unified runtime and programming interface.”
The framework augments the functionality of distributed computing and ML libraries like Dask and scikit-learn, among others, with a distributed implementation of workflows based on Python, according to Nagpurkar.
Potential use cases
CodeFlare has the potential to address the emergence of converged workflows, where AI, data analytics and modeling are weaved together to provide much faster time-to-value than traditional approaches, Nagpurkar said.
“These workflows are emerging in a wide range of enterprise domains. For example, drug discovery, where these complex pipelines are applied to adjust treatment protocols, manufacturing and supply chain optimization, where process modeling and simulation are coupled together to perform significantly better than existing heuristics,” she said.
Another potential use case is in semiconductor design, “where complex ML pipelines are used in the identification of chip defects without slowing down production,” Nagpurkar said.
Benefits for developers
CodeFlare should potentially mean developers won’t have to duplicate their efforts or struggle to figure out what colleagues have done in the past to get a certain pipeline to run, IBM said. “With CodeFlare, we aim to give data scientists richer tools and APIs that they can use with more consistency, allowing them to focus more on their actual research than the configuration and deployment complexity,” IBM said.
The company said it expects the framework to save developers significant time and effort in creating pipelines deployed to the hybrid cloud.
Already, when one user applied the framework to analyze and optimize approximately 100,000 pipelines for training machine learning models, CodeFlare cut the time it took to execute each pipeline from four hours to 15 minutes, according to IBM.
Other users have seen CodeFlare “shave off months of developer time and allow them to tackle larger data problems than before,” the company said.
CodeFlare is being open-sourced, and IBM is providing a series of technical blog posts on how it works and what users need to start using it.