What is Machine Learning?

Browse Livestreams

Regularization Techniques for Linear Regression
Linear Regression Modeling
Intro to Gradient Descent
Intro to Linear Regression
Covariance and Correlation (Bivariate EDA)
Data Visualizations for EDA (univariate)
Intro to Exploratory Data Analysis (EDA)
Math for Data Science
Classifying Penguins with Decision Trees
Supervised Learning - Classification vs Regression
What is Machine Learning?

Save
Run All Cells
Clear All Output
Runtime
Download
Difficulty Rating

Loading Runtime

Materials from this workshop:

Colab Notebook (code)

Train a Machine Learning Model using Scikit-Learn

Below is code that will load the Penguins dataset and immediately split 30% of the observations into a test dataframe, and the remaining 70% of the observations into a train dataframe.

Your task is to use these two datasets to train a Decision Tree algorithm using Scikit-Learn step-by-step.

  species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie       2            39.1           18.7              181.0   
1  Adelie       2            39.5           17.4              186.0   
2  Adelie       2            40.3           18.0              195.0   
3  Adelie       2            36.7           19.3              193.0   
4  Adelie       2            39.3           20.6              190.0   

   body_mass_g  sex  
0       3750.0    0  
1       3800.0    1  
2       3250.0    1  
3       3450.0    1  
4       3650.0    0

(103, 7)

(239, 7)

Divide the `train` dataset into `X_train` and `y_train`

Select the species column from the train dataframe and save the result to the variable y_train.
Select the remaining columns from the train dataframe and save the result to the variable X_train.

Please note the lowercase y and capital X in y_train and X_train, respectively.

Reset Code

Divide the `test` dataset into `X_test` and `y_test`

Select the species column from the test dataframe and save the result to the variable y_test.
Select the remaining columns from the test dataframe and save the result to the variable X_test.

Please note the lowercase y and capital X in y_test and X_test, respectively.

Reset Code

Train (fit) a `DecisionTreeClassifier` model

Refer back to the code that we wrote together during the workshop to do the following:

Import the DecisionTreeClassifier from sklearn.tree
Store a DecisionTreeClassifier a variable called model.
Use the model's .fit() method to train the algorithm. You'll need to pass the algorithm your training data X_train and y_train -in that specific order.

This three step process will be nearly identical for any Scikit-Learn algorithm that we choose to use.

Reset Code

Check the model's accuracy with using the `X_test` and `y_test` data.

You can do this manually using the .predict() method and hten calculating the percentage of correct predictions - or -
You can use the model's .score() method (this is much easier).

Please save the model's accuracy on the test data to a variable called dt_accuracy.

Reset Code

STRETCH GOAL CHALLENGE!

Stretch Goal Challenges are "extra credit" tasks that weren't necessarily taught you during our live workshop but that ask you to use your own brain/research to go a little bit above and beyond all on your own.

Can you...

Train a different machine learning algorithm from Scikit-Learn on the same data? Try using a RandomForestClassifier

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

You'll follow exactly the same steps that we did for the Decision Tree Classifier.

Save your RandomForestClassifier's final accuracy to a variable called rf_accuracy.

Reset Code

Browse Livestreams

Loading Runtime

Materials from this workshop:

Train a Machine Learning Model using Scikit-Learn

Divide the train dataset into X_train and y_train

Divide the test dataset into X_test and y_test

Train (fit) a DecisionTreeClassifier model

Check the model's accuracy with using the X_test and y_test data.

STRETCH GOAL CHALLENGE!

Divide the `train` dataset into `X_train` and `y_train`

Divide the `test` dataset into `X_test` and `y_test`

Train (fit) a `DecisionTreeClassifier` model

Check the model's accuracy with using the `X_test` and `y_test` data.