• Save
  • Run All Cells
  • Clear All Output
  • Runtime
  • Download
  • Difficulty Rating

Loading Runtime

Materials from this workshop:

Train a Machine Learning Model using Scikit-Learn

Below is code that will load the Penguins dataset and immediately split 30% of the observations into a test dataframe, and the remaining 70% of the observations into a train dataframe.

Your task is to use these two datasets to train a Decision Tree algorithm using Scikit-Learn step-by-step.

species island bill_length_mm bill_depth_mm flipper_length_mm \ 0 Adelie 2 39.1 18.7 181.0 1 Adelie 2 39.5 17.4 186.0 2 Adelie 2 40.3 18.0 195.0 3 Adelie 2 36.7 19.3 193.0 4 Adelie 2 39.3 20.6 190.0 body_mass_g sex 0 3750.0 0 1 3800.0 1 2 3250.0 1 3 3450.0 1 4 3650.0 0
(103, 7) (239, 7)

Divide the train dataset into X_train and y_train

  • Select the species column from the train dataframe and save the result to the variable y_train.
  • Select the remaining columns from the train dataframe and save the result to the variable X_train.

Please note the lowercase y and capital X in y_train and X_train, respectively.

Reset Code

Divide the test dataset into X_test and y_test

  • Select the species column from the test dataframe and save the result to the variable y_test.
  • Select the remaining columns from the test dataframe and save the result to the variable X_test.

Please note the lowercase y and capital X in y_test and X_test, respectively.

Reset Code

Train (fit) a DecisionTreeClassifier model

Refer back to the code that we wrote together during the workshop to do the following:

  • Import the DecisionTreeClassifier from sklearn.tree
  • Store a DecisionTreeClassifier a variable called model.
  • Use the model's .fit() method to train the algorithm. You'll need to pass the algorithm your training data X_train and y_train -in that specific order.

This three step process will be nearly identical for any Scikit-Learn algorithm that we choose to use.

Reset Code

Check the model's accuracy with using the X_test and y_test data.

  • You can do this manually using the .predict() method and hten calculating the percentage of correct predictions - or -
  • You can use the model's .score() method (this is much easier).

Please save the model's accuracy on the test data to a variable called dt_accuracy.

Reset Code

STRETCH GOAL CHALLENGE!

Stretch Goal Challenges are "extra credit" tasks that weren't necessarily taught you during our live workshop but that ask you to use your own brain/research to go a little bit above and beyond all on your own.

Can you...

Train a different machine learning algorithm from Scikit-Learn on the same data? Try using a RandomForestClassifier

You'll follow exactly the same steps that we did for the Decision Tree Classifier.

Save your RandomForestClassifier's final accuracy to a variable called rf_accuracy.

Reset Code