***

# Modelling Notebook 

***

# Creating A Binary Classification Model 



## Prerequisites

To run this notebook you should have:

- This notebook can be run on the [General Machine Learning](https://docs.oracle.com/en-us/iaas/data-science/using/conda-gml-fam.htm) for CPU on Python 3.8 (version 1.0) **if you update ADS to the latest version**.
- This notebook requires authorisation to work with the OCI Data Science Service. We will do this through the resource principal. Additional  details can be found [here](https://accelerated-data-science.readthedocs.io/en/latest/user_guide/cli/authentication.html#).
- The smoking dataset used in this notebook can be found [here](https://www.kaggle.com/datasets/kukuroo3/body-signal-of-smoking).

***

### First we will load in the required libraries 


In [None]:
# Ensure Oracle ADS is up to date
#!pip install -U oracle-ads

In [None]:
# Required for data exploration and cleaning
import pandas as pd
import numpy
import ads 
from ads.dataset.factory import DatasetFactory


# Required for creating and evaluating models 
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from ads.common.model import ADSModel
from ads.evaluations.evaluator import ADSEvaluator
from ads.common.data import ADSData


# Required for saving and deploying models 
from ads.model.framework.sklearn_model import SklearnModel
import tempfile
import json
from shutil import rmtree
from ads.model.model_metadata import UseCaseType

***

## Authenticate  
Authentication to the OCI Data Science service is required. Here we are using resource principal, you could use an api key instead, details can be found [here](https://accelerated-data-science.readthedocs.io/en/latest/user_guide/configuration/authentication.html#configuration-file).

In [None]:
# Authenticate with OCI Data Science Service
ads.set_auth(auth="resource_principal")

***

## Load Data
The data has been uploaded to this notebook session and stored in a directory called "Data". The data is read in as a pandas dataframe, however, for some ads methods we wish to use in this notebook we will require the data to be an ADSDataset. More information about ADSDatasets can be found [here](https://accelerated-data-science.readthedocs.io/en/latest/user_guide/loading_data/connect.html#convert-pandas-dataframe-to-adsdataset).

In [None]:
# Read in csv file form OCI noteboook instance 
df = pd.read_csv('Data/smoking.csv')

# Convert the data set to an ADSDataset requried for "show_in_notebook" function 
smoking_ds = DatasetFactory.open(df, target="smoking").set_positive_class(1)

***

## Correlation plots
Oracle ADS has 3 built in correlation methods: 
* **Pearson correlation** - for continuous numerical variables, *e.g. df.ads.pearson()*
* **Correlation ratio** - to compare categorical variables to continuous variables, *e.g. df.ads.correlation_ratio()*
* **Cramer's V**  - to measure the amount of association between two categorical variables, *e.g. df.ads.cramersv()*

Each of these have an associated plot function to visualise the correlations for example df.ads.pearson_plot() where “df” in these examples can be a pandas data frame or an ADS dataset.

In [None]:
smoking_ds.ads.pearson_plot()
type(smoking_ds)

***

## Show in notebook

The ADS *show_in_notebook* method creates a preview of all the basic information about the data set. You can apply the ADS *show_in_notebook* method on an ads.dataset but not directly on a pandas data frame. More information about this can be found [here](https://accelerated-data-science.readthedocs.io/en/latest/ads.dataset.html?highlight=show_in_notebook#ads.dataset.dataset.ADSDataset.show_in_notebook).

In [None]:
# Overview of Data 
smoking_ds.show_in_notebook()

***

## Suggest recommendations

The suggest_recommendations function highlights issues with the data and suggests changes to apply to the dataset that would make it more suitable for modelling.


In [None]:
smoking_ds.suggest_recommendations()

***

## Auto transform

Auto transform will apply all the recommended changed from suggest_recommendations. This function returns a transformed dataset, created from preforming all the recommendations at once.

In [None]:
transformed_smoking_ds = smoking_ds.auto_transform(fix_imbalance=False)

***

## Visualize Transforms

If you have used auto_transform to preform the transformations you can use the visualize_transforms() function to view them. This function only works with the automated transformations and does not capture any custom transformations that you may have applied to the dataset.

In [None]:
transformed_smoking_ds.visualize_transforms()

***

## Creating Training and Test Datasets

Split the data into a training and test data set, here we ate taking 15% of the data as the test data set and 85% as the training data set. 

In [None]:
# Split the data into a training and test data set, here we are taking 15% of the data as the test data set and 85% as the training data set. 
train, test = transformed_smoking_ds.train_test_split(test_size=0.15)

In [None]:
# Splitting train and test X an y out for clarity
X_train = train.X
y_train = train.y
X_test= test.X
y_test = test.y

In [None]:
print(X_train.shape)
print(X_test.shape)

***

## Modelling 
Here we are using *sklearn* to train a Logistic Regression and a Random Forest Classifier Model.


In [None]:
lr_clf = LogisticRegression(random_state=0, solver='lbfgs',
                    multi_class='multinomial').fit(X_train, y_train)

In [None]:
rf_clf = RandomForestClassifier(n_estimators=100,
                                random_state=42).fit(X_train, y_train) 


***
    
## Evaluate Models using `ADSEvaluator`

ADS supports evaluating:

- regression
- binary classification
- multiclass classification

Below we can see a table of common evaluator metrics, `ADS` supports the addition of your own custom evaluation functions.


In [None]:
# Converting the models to ADS Model formats
bin_lr_model = ADSModel.from_estimator(lr_clf, classes=[0,1])
bin_rf_model = ADSModel.from_estimator(rf_clf, classes=[0,1])

In [None]:
# Creating the ADS evaluator 
evaluator = ADSEvaluator(
    ADSData(X_test, y_test),
    models=[bin_lr_model, bin_rf_model],
    training_data=ADSData(X_train, y_train),
)

In [None]:
# Prining out all the Model evaluator metrics in a table 
print(evaluator.metrics)

***
    
## `ADSEvaluator` and `show_in_notebook`

You can then use this ADS evaluator with *show_in_notebook* to visualise a range of evaluation plots. These include the precision-recall, ROC, lift, gain plots, and normalised confusion matrices. Each model given to the fucntion is plotted together, allowing comparison between mdoels.



In [None]:
# Plotting common evaluator metrics 
evaluator.show_in_notebook()



### Precision-Recall curve
The precision-recall curve shows the tradeoff between precision and recall for different threshold, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. A large area under the curve represents both high recall and high precision.


### ROC Curve
An ROC curve (Receiver Operating Characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. The ROC Curve summarises the performance of the model, with increasing area under the curve indicating a better performing model. It is generated by plotting the true positive rate (sensitivity) against the false positive rate (1-specificity). The area under the ROC Curve is a measure of the probability that a randomly chosen event would be correctly predicted. This measure can be good for class imbalanced datasets since it simultaneously optimises both Sensitivity and Specificity as opposed to accuracy.


### Lift Chart 
A lift chart graphically represents the improvement that a model provides when compared against a random guess, and measures the change in terms of a lift score.
Lift is the ratio of the number of positive observations up to decile *i* using the model to the expected number of positives up to that decile *i* based on a random model. Lift chart is the chart between the lift on the vertical axis and the corresponding decile on the horizontal axis.

$$ Lift = {Cumulative\_number\_of\_positive\_observations\_upto\_decile\_i\_using\_ML\_model \over Cumulative\_number\_of\_positive\_observations\_upto\_decile\_i\_using_random\_model} $$


### Gain Chart 
Gain is the ratio between the cumulative number of positive observations up to a decile to the total number of positive observations in the data. The gain chart is a chart drawn between the gain on the vertical axis and the decile on the horizontal axis.

$$ Gain = {Cumulative\_number\_of\_positive\_observations\_upto\_decile\_i \over Total\_number\_of\_positive\_observations\_in\_the\_data} $$

### Normalised Confusion Matrix 
The “normalised” term means that each of these groupings is represented as having 1.00 samples. Thus, the sum of each row in a balanced and normalized confusion matrix represents 100% of the elements in a particular topic, cluster, or class.


$$
  Sensitivity=Recall=\frac{TP}{TP+FN}  
$$
i.e. Out of all the positive examples, how many are predicted as positive?
    
$$
  Specificity=\frac{TN}{TN+FP}  
$$
i.e. Out of all the "negative" class examples, how many were correctly predicted negative?

$$
  Precision=\frac{TP}{TP+FP}  
$$
i.e. Out of all the examples that predicted as positive, how many are really positive?
    
$$
  Accuracy=\frac{TN+TP}{TN+TP+FN+FP}  
$$
i.e. Proportion of correctly predcited values
    
$$
  F-value=\frac{2TP}{2TP+FP+FN}  
$$
This is the F1-score, it is the most commonly used F-score or value. It is a combination of precision and recall, namely their harmonic mean. It doesnt include the True negative value, so its useful if you have a class imbalanced data set and you care about predicting as many True Positive results as possible. 


***

# Saving and Deploying Models 

ADS can be used to quickly prepare, save, and deploy a model.

## Preparing a Model

The first step is to prepare the model, this involves creating a Model Artefact that contains the following items:
* A serialised model;
* `runtime.yaml` -  information about the model and required conda environment;
* `score.py` - used by the model deployment server to load in the model and create predictions;
* `input_schema.json` - Example input (optional);
* `output_schema.json` - Example output (optional);
* Any other artefacts required.


In [None]:
# Create temp dir
artefact_dir = tempfile.mkdtemp()

# Here I have specified a conda environment published to an OCI storage bucket, 
# the inference_conda_env and training_conda_env would need to be changed accordingly
sklearn_model = SklearnModel(estimator=rf_clf, artifact_dir=artefact_dir)
sklearn_model.prepare(
    inference_conda_env="oci://<Bucket_Name>@<Namespace>/conda_environments/cpu/RM_DS_ENV/1/rm_ds_envv1",
    training_conda_env="oci://<Bucket_Name>@<Namespace>/conda_environments/cpu/RM_DS_ENV/1/rm_ds_envv1",
    use_case_type=UseCaseType.BINARY_CLASSIFICATION,
    X_sample=X_train.head(5),
    y_sample=y_train.head(5),
    force_overwrite=True,
)



The `.summary_status()` method shows us what steps are left to complete in our data science workflow. We can see we have completed the prepare step, this has created the files needed for the model, and the next "Available" but not "Done" step is `verify()`. 

In [None]:
sklearn_model.summary_status()

***

## Verify the model

The verify method invokes the ``predict`` function defined inside ``score.py`` in the artifact_dir. Here we are testing the ``score.py`` file with a sample of the test data to create predictions.

In [None]:
sklearn_model.verify(X_test[:10])

***

## Save the model

We can then save this model out to the OCI model calalog, supply your model a unique name! This will fail if there is already a model with the same name saved to the catalog. 

In [None]:
sklearn_model.save(display_name="RF_Smoking_Model")

If you go to the OCI Console -> Analytics & AI -> Data Science -> Models. You will be able to see your model saved there. 

***

## Deploy

When the model is in the model catalog, you can use the model's `.deploy()` method to deploy it and it will return a `ModelDeployment` object. In the cell below we have kept all the default settings except the display name.

You can also deploy the model from the OCI console, by clicking on the 3 dots next to your saved model in the OCI model catalog. (You do not need to do both).

In [None]:
deploy = sklearn_model.deploy(
    display_name="Random Forest Model For Smoking Classification",
)

Running summary_status again we can see that we now have an "ACTIVE" deployment of our created model. 

In [None]:
sklearn_model.summary_status()

***

## Predict

Since the model is deployed and we can now use it to create predictions, using the `.predict()` method. 

In [None]:
ExampleDataToPredict = X_test.head(20)
sklearn_model.predict(data=ExampleDataToPredict)

***

##  Invoking the model from an HTTP endpoint

In the blog post "Introduction to Oracle Data Science Service Part 3" there was an example of how to call a deployed model from its HTTP endpoint. In that example I supplied input data in a json format, the payload of which I obtained from here.

In [None]:
ExampleDataToPredict.head(1).to_json()
#ExampleDataToPredict[:5].to_json()

***

##  Loading in a saved model from the OCI Model Catalog

The `.predict()` and `.verify()` methods above worked because we created and deployed the model in this notebook. If we wanted to call this model from another notebook or a different data scientist wanted to call this from their own session. They would first need to define which saved OR deplyed model they would like to use.


To load in a saved model from the the OCI Model Catalog:

In [None]:
# Change the OCID to the SAVED model OCID
saved_model = SklearnModel.from_model_catalog(
    "ocid1.datasciencemodel.oc1.xxx.xxxxx",
    model_file_name="model.joblib",
    artifact_dir="rf-download-test",
)

Now you have the saved model on this notebook session and you can use it to generate your predictions using verify. (NOTE the `.predict()` only works on deplyed models. `.verify()` works on models in the notebook session you are running).

In [None]:
#To create predictions from a model that isnt deployed, use verify. 
saved_model.verify(ExampleDataToPredict)["prediction"]

***

##  Referencing a model deployment from a different notebook 
To create predictions from a deployed model, you need to specify the deployed model you wish to use. You will need to supply the OCID of the **deployment**. 

In [None]:
# Find the DEPLOPYMENT OCID and change it here. 
deployed_model = SklearnModel.from_model_deployment(
    "ocid1.datasciencemodel.oc1.xxx.xxxxx",
    model_file_name="model.joblib",
    artifact_dir="deployed-download-test",
)

Since this deployment is active, you can call `.predict()` on the model object to send request to the deployed endpoint, and return with predictions. 

In [None]:
deployed_model.predict(ExampleDataToPredict)

**Remember not to leave models deployed unless they are regularly called, since you will be charged for the number of hours the model is deployed.**