Distributed Machine Learning with Synapse ML (MML Spark)

Video Transcripts:

how do you scale your machine learning solution over large amount of data have you ever heard about synapse ml or previously known as microsoft machine learning for apache spark then let's go hello everyone this is mg and here i am with another video which i'm going to talk about a massively scalable machine learning library that is built on apache spark and it's named as synapse ml which enable you to train production-ready machine learning models in a larger scale and it is very well integrated with machine learning ecosystem and libraries namely ml flow then let's check it out before we start make sure you subscribe and hit the bell icon so you will get notified for the next video thank you all right let's start with what is synapse ml so it used to be named as mml spark which stands for microsoft machine learning for apache spark and now it's called synapseml so first of all when i say synapse ml this is a library and it doesn't mean that you have to use azure synapse service to be able to use synapse ml so this is a library you can use it in azure synapse if you don't know azure synapse that's actually a not only data warehousing solution on azure but you can have some advanced analytics on the top of your data warehousing solution within azure ecosystem so you can check it out azure synapse that's totally a separate service in azure but here we're talking about a library that not only can use it in synapse but also you can use let's say in databricks any sparking environment that you have so the question is why should i use synapse in the library and what it is and what are going to be the benefits so synapse ml is a library for machine learning which is massively scalable you can have even a cluster that is running on a spark with hundreds of machines and you can have synapse ml integrated so you can have production bleeding models and when i say models i mean even a simple classification regression or any image analysis a speech to test translation anomaly detection any any ml challenge that you're facing you can leverage synopsis well in a scalable mallet manner with large scale amount of data so the question is is it a sort of a new api that i have to learn and i have to understand the analogy of synapse ml and how to code well the answer is not really because it is deeply and cleanly integrated with currently existing spark ml apis so you know that with the spark ml you can have distributed machine learning solution there right so synapse ml can provide some more additional capabilities we're gonna talk about them and it has done some even contributions back to ml flow integration with spark that you can have and leverage those and we're gonna talk about them what they are specifically with the latest version that they released the time that i'm recording this video which is version 0.10 so you can use a variety of different languages with it let's say python scala java r and also now event.net c sharp so on and so forth and you can check out the list of all supported coding languages in their documentation and also this synapse ml is integrated wide variety variety of different ml technologies and libraries let's say light gbm or if you remember we talked about on an x it is already integrated with it and also recently they added more couple of capabilities that are going to talk about them first of all if you know openai which is also integrated with azure you can have a variety of powerful mainly language models like transformers gpt models so these are now integrated with synapseml that means i can leverage these language models coming from openai in a scalable manner with larger scale amount of data through synapseml or if you're using cognitive services what are cognitive services there are some azure based services that you can use some pre-trained models as an api to your solution for example you have a text and you want to do some sentiment analysis you don't need to train a model just so you can have an azure cognitive service call that api it will retrieve you back the sentiment analysis results so there are so many cognitive services for text images translation api all those pre-trained models with cognitive services on azure are now integrated with synapse ml that means you can scale the application of cognitive services or these trained models using synapseml on larger scale amounts of data and have the solution there there are so many demos on an example notebooks in their documentation which are going to add the sh the the link of those in the video description you can check them out and last but not least now synapse ml is integrated with ml flow so if you use synapse ml for training a model that means now you can use ml flow for your synapse ml models to save them load them and even deploy them which is also an integrated contribution from synapse ml team back to ml flow that they can now support synapse several models um they have also an invert they have used a binder as an environment that you can use live demos what does that mean uh imagine that now you have a github repository that you have lots of codes and instead of just having codes there within the same your github repository now you can even execute those codes let's say in jupyter notebook and you don't need to be worried about setting up the environment having a server there's a server on backend so you can just launch the code from the github and even execute them which we were gonna do that in one update on notebook examples to check how synapse ml works so before we jump to the code here is the main website synapse and well if you just quickly search it you'll find the link and here are very nice examples of how you can use as i mentioned cognitive services here for example for sentiment analysis quickly through synapse ml in a scalable manner over large scale amount of data on your input data frame with your selected language so on and so forth you can leverage deep learning in a scalable manner using synapse ml also if you have um responsibility challenges you want to address expandability issue of your trade models and your models using large amount of data you can use synapse ml integrated with responsibility packages let's say shop here to explain your model of features this is actually very cool capability and let me know if you're interested to have a separate video on session to just see how we can have responsibility in a larger scale through synapseml that we can have a separate video for this which i think gonna be great uh it is integrated large gbm and even opencv you can have some image based transformation using opencv but in a larger scale because you are using it through synapse ml this is great um we already talked about some key components and capabilities of synapse ml which is already mentioned here so i'm gonna pass through it and let's go actually through the uh binder demo environment and run one of the notebooks there for doing so i want to go through this article i will add this to the video description uh which is summarizing some of the great recent capabilities of synapse ml with the recent release for the time that i'm recording this video today is uh august 21st 2022 and uh here is a link if you all the viewers scroll down you can check our binder site to get us started with synapse ml without worrying about setting up an environment in a spark environment installing experience infrastructure and even you don't need any azure records so i clicked on it let's actually do so and you'll see that binder is starting the repository of microsoft synapse ml and all those codes in github repository are now an interactive jupyter notebook for me that i can run them over a server so you have to wait for a couple of minutes for to give it a time to the server to start running and i already did this so i can quickly go through the codes and you can see now the server is ready i have the notebook coming from synapsial documentation there are examples for how to use let's say cognitive services with synapse ml how to use responsible ai in a scalable manner using synopsimo and i think i chose regression notebook and i ran it just before recording this video let me actually bring it there you go so yes i clicked on the regression and there was an example of creating a regression model to predict the delay for flights so what i did i just simply run the code and i can show you quickly the results that i got so first of course i needed to create the spark session through pi spark i'm using python in the spark api so here's my session written in spark object let me go all the way down and skip the results of code execution and then i needed to import my synapse ml there's a way that how we can install synapse and documentation it's fairly easy but because i'm using the body environment i don't need to install it's already there so i'm calling it and now here is the place that i'm importing my csv data that's my training data set it's coming from a profit file and a blob storage um then i do have access it's public and then with executing this code i can see the schema of my data set and just the first 10 rows of these flight delay data in the pandas data frame you can see that there are some features that i have month day of monthly a week the carrier time the name um the original destination all the way the delay that they had and that's it so let's see how we can train a model using synapse ml so i'm randomly splitting the data to 75 and 25 percent you know why we're doing this of course we need training data set and test data set and then here are the packages needed for training my model so we from synapse ml i'm importing the train regressor so i just want to train a simple regression model for predicting the flight delay and some other specific components from pi sprite that we're gonna talk why we're using them and how so the first thing that we did here is converting some of the columns let's say this one to categorical so that's why i have the name of the columns that i want to have them in categorical manner and i rename it my train and test categorical data set because i want to do the same for both training and test data set i mean converting columns to categorical so for doing so i'm adding indexer to the input columns which are the categorical columns that are specified here and just give it a name this is temporary and i want to rename it so when i have this transformation i can call it over my training data drop that column and temporary column and just rename it the same thing for the test data set so i'm just doing a very simple transformation with the data and here's the place that i'm defining my regression here with the specific parameters of course that would be different if you have a different algorithm type but here it is a regression that these are the parameters that i'm specifying and i'm calling those configurations in this part with using train regressor that is coming from synapse ml there you go and here's the column that i'm going to predict so that's my label which is the delay and i'm fitting this in my training data set so i run the code and then here's the place that i can score my test data to see how my model is working here are some configurations for the place that i'm storing my model if it's on synapse that's a different path if i'm using synapse and welding data breaks here's the way that i can store over a databricks file system but here i can just distort a model in my temporary current working directory right and i can then write the model and save the model then i can load the model back from the given directory and then do the transformation over the test data which is actually doing the prediction here and it is coming from that i mean the transform object which is doing the prediction coming from my model that it loaded and just showing the first 10 rows you can see that here's the predicted flight delay so what we can do we can of course calculate the metrics of the trained model based on the score data set to see how the model is performing with some statistical analysis so for doing so i am importing compute model statistics from synapseml which is a scalable that is great so if i have large data still i can calculate this metrics i did so with the scored data which i have it here and then i convert the result to pandas to have a nice way to visualize it and you can see it's regression so that's why you have r2 score mean absolute error so on and so forth so what you can do also you can calculate these metrics per instance i mean per each row of predicted value you can calculate for example l1 and l2 loss which is the difference of your predicted value and true value and the also scored metric skirt metric of that for doing so you need to import compute per instance in statistics from synapse and ml train and then you call your scored data and you say that i want to actually calculate l1 l2 score for the just show me the first 10 rows you can see that for each row here that i have i calculated l1 and l2 loss based on predicted value and true value so i'm doing this per instance that again can be a scaled because you're using synapse ml and it is leveraging also a spark so that was just an example of how you can use it in action and it was just a very simple regression model to make it fairly clear and high level of what is this what is synapsil how it can be used and what are the benefits and of course based on your own specific machine learning project you can see from documentation what other synopsisal capabilities can be great fit to your project that we ordered in high level sunrise and talked about it that's it and i hope you enjoyed this video you are not born a winner and you are not born a loser you are born a chooser have fun my friends and we'll see you next week