• Save
  • Run All Cells
  • Clear All Output
  • Runtime
  • Download
  • Difficulty Rating

Loading Runtime

Let's remind ourselves –why are we learning Python?

Why are we learning Python? Because we can use Python to do powerful data science. So let's take a pause from the minutiae of Python Syntax to talk about Data Science and Machine Learning from a high level to orient ourselves with what we've already learned and with where we're headed.

Machine Learning is all about using data to make useful predictions. By the way, you're not going to hear me say "AI" or "Artificial Intelligence" very often. Because to me, AI (in the kindest interpretation of the term) is just really good machine learning. Maybe even so good that it can do some things that humans can do. Hopefully, even better than humans can do them.

But, they're pretty much the same thing, just that to me, the term "AI" brings along with it a lot of extra connotations that I'm not as concerned about at this very moment. I'm talking everything from the hot tech AI startup-hype connotations to the, Terminator murder robot, connotations. It feels to me like the world lets us leave a lot of that extra stuff alone and just talk about the core technology when use the term "Machine Learning." So, if you're here because you want to learn about "AI". Well, you are learning about it, I'm just not going to call it that.

What have we learned so far?

With that out of the way, what have learned so far? We've learned that Python is our go-to language for doing data science, that Data Scientists use notebooks heavily for doing their work. And we've learned about storing numbers to properly named variables. We've definitely learned more than that, but those are the major highlights.

So, why are we pausing here? Well, we just finished learning about numeric data types in Python (integers and floats) –and by the way, I love that I can use even that basic jargon with you now that we've established it.

The reason I've decided to pause now is because Machine Learning is all about numbers. You might be saying, "Well duh Ryan" But, Machine Learning might be even more about numbers than you currently realize, in order for me to explain what I mean, let's look at a specific example of machine learning that I think everyone will be able to relate to. And while we do that we'll establish a high-level framework for how machine learning modeling works, and we'll learn some important data science terminology along the way.

What is a Machine Learning Model?

So what is a Machine Learning all about? And more specifically, what is a machine learning "model"? To introduce to us idea of a machine learning model, I want to imagine a physical machine as an analogy for what predictive models are all about.

A machine (generally speaking) takes in inputs, does some stuff on the inside, and then spits out outputs. For example, let's imagine a machine that makes bread.

We give the machine inputs like flour, water, yeast, salt. And we want it to spit out a freshly baked loaf of bread on the other end. If we were mechanical engineers we would dive into the internals of the machine and give it different capabilities inside of it, a mixer that can mix up the ingredients, a container for the dough, an oven to cook the bread in, etc.

Bread Making Machine

With a physical bread making machine, the machine has to be built specifically to make bread, and, if you make a bread-making machine, you're not going to get a bicycle as an output, even if you change the inputs to bicycle related inputs. Physical machines have to be built in extremely specific ways to perform very specific functions.

This is not the case with Machine Learning Models. Predictive models are computational machines where we call the inputs "data", and the outputs we call "predictions". In the simplest sense that's all that a model really is, it's a computer machine that turns data inputs into prediction outputs.

Data Inputs Prediction Outputs

But the amazing thing about machine learning models is that they build themselves (to a large extent). They're not specifically programmed to do any one thing. And this is where the "learning" part of "machine learning" comes in. Instead of being programmed to complete a specific task, machine learning models modify their internal function over many iterations of trial and error until they get good at translating the given inputs to the expected outputs.

To visualize this a little bit more concretely, I want us to imagine together how a bread making machine would function if it worked like a machine learning algorithm. Maybe something like this:

You give the machine appropriate bread-making ingredients (that's important), and then you say, "Machine, please make me bread that is like this." –and you show it a beautiful loaf of bread. The machine runs for a while and then it spits out a disgusting black lump of something totally inedible on the other end. You say, "bad machine, that's not very close to what I asked for. Try again, and this time, please make something that is more like this: and you show it the good loaf again.

The machine tries again, and the second time, it spits out another inedible lump, but this time it's not quite as black. You say, "Hey, that's ever so slightly closer to what I wanted, but still bad. Please make me bread like this". -and you show it the good bread yet again.

It churns for a while and then another inedible lump comes out that is a little more loaf-shaped and slightly less black than the previous-run.

This process is repeated thousands, maybe even millions of times until eventually through what at the end feels like sheer dumb luck or maybe even brute force, the bread machine spits out a slightly "over proved, and under-baked" loaf of respectable-enough bread. It's not perfect, but it definitely –and unmistakably– qualifies as a loaf of edible bread. Success!

This is machine learning. After we started this process of repeated trial-and-error-type feedback, we never went inside of the machine to tinker with its internals. The machine handled that all on its own. When all's said and done we might even not quite be sure what exactly is happening inside of the machine that enables it to accomplish this. But we don't really care too much either, because none of that matters if we're not getting the output that we want. If the output isn't a loaf of bread, then it doesn't matter what's going on inside the machine.

Does this seem kind of like a minor miracle? Well in some ways it is, and in other ways it's a lot of very boring and simple guess-and-check iteration, that when taken together adds up to a complex and interesting –even seemingly intelligent– behavior.

A classic Machine Learning meme says:

changing random stuff fast enough is machine learning

The importance of good data

Even with a magic self-rearranging bread-making machine, we still had to give it the ingredients for bread. We couldn't feed it gravel and horse manure and expect beautiful bread out the other end. It's not like it can rearrange sub-atomic particles or something like that. Machine learning models are similar. A common saying in the industry is "garbage in, garbage out". The data our model is provided has to relate to the predictions that we want to make in some kind of a patterned, correlated, discernible way. That's really what a machine learning model's job is and where the term "model" comes from. It's job to figure out or "model" how those inputs relate or map to corresponding outputs so that we can repeatably produce the right outputs when given good inputs.

But what I've said is still a bit generic and vague, and, by the way, how does this all relate to the Python we've learned about so far, particularly integers and floats?

Well let's look at a specific example of machine learning together to make this a bit more concrete and while we do so we'll establish some important terminology that we'll be using throughout our modeling journey.

Modeling Housing

Let's imagine that we want to model some housing data. Specifically we want to use the characteristics of single-family homes to estimate how much they might sell for –their "sale price".

Let's start with the inputs –the data. We need that before we can do anything else.

Tabular Data

Now, as I go about collecting data I need a place to store it. A common way of doing this is creating what's called "tabular data". This means data that's stored in a table –a grid of rows and columns. There's lots of different names for this very common data format, CSV files, flat files, spreadsheets, tables, database tables, dataframes, etc. For this example I've just opened up a Google Sheets spreadsheet and we're going to invent some fake data together.

The first thing that we're going to ask ourselves is "what do I want to predict?" And this can be a tough question to answer sometimes, but the answer to it is going to affect a lot of our decision making and will even help dictate what kind of machine learning algorithms we can even try to use to model this data. (We'll talk about that more in a minute).

Luckily, in our case, it's pretty straightforward. We want to predict the sale price of homes, so let's make a column called sale_price and we'll imagine that we've collected the sale price of... say... 5 different homes. I'll just make up some numbers here in the hundreds of thousands of dollars-or so. Now, when we're collecting our data in a table each row of the table will represent a different "record" or "observation" in our case our unit of observation is a house or a single property. So each row will hold the information of a specific house and each column will hold a specific characteristic of those houses.

So let's add some new columns here. Let's add a column for the square footage of the home, the acreage of the property, and... the number of bedrooms. Sorry about using square footage and acreage instead of square meters and hectares or whatever, but they're measuring the same thing, just with different units.

Now I'll fill these in with some fake data. Don't be alarmed about the fakeness of this data here, it really is okay to work with fake data, it's trying to pass fake data off as if it were real data that will get you in trouble.

As I make up this data I want you to ask yourself, "What data type would you use to represent these different measurements?" How about number of bedrooms (num_bedrooms)? Well, that would naturally lend itself to being represented by integer values. It doesn't make much sense to have less than a whole bedroom does it?

What about lot size in acres though (lot_size_acres)? I might want some extra granularity here since (at least where I live) most homes aren't on anywhere near a full acre lot. (There's about 2.47 acres in a hectare by the way.) So I'll need to record the acreage values using floats.

The sale price and the square footage (sale_price & square_ft)? Well, it's up to me. Do I care to have a measurement so precise that I'm measuring the sale price to the nearest cent? or the square footage to the nearest tenth, hundredth, or thousandth of a square foot? or something like that? Well that level of specificity is probably more hassle than it's worth, so we'll just round those to the nearest whole number as well.

Alright, so we've already seen both floats and integers in action, just in how we've chosen to record our data, the units that we've chosen to use and the attributes of our unit of observation (houses) that we've selected.

Can you see how many decisions we've already had to make, both implicitly and explicitly just to collect our data? And I mean, we haven't even really collected it, it's fake data after all, but there are still a lot of considerations to make. Navigating all of these different considerations as they pop up and doing it wisely is a big part of any data professional's job.

Tabular Data

Machine Learning Algorithms

Alright, so now that we've set the stage. Now here comes a boatload of new terminology. I'm going to drown you in some new terminology, but after I've introduced it all, I'll call your attention to which of these terms are the most important to have absolutely memorized and which ones you just need to have a passing familiarity with.

A machine learning model does its job of connecting inputs to outputs through something called an "algorithm". An algorithm is a series of instructions, or you might say a "procedure" for what we should do to the inputs in order to turn them into the outputs. To turn our housing data into sale_price predictions.

When you think about giving a computer a "series of steps or a set of instructions to be executed", how might we do that? Well, the way we give instructions to a computer is through code. So from a high level an algorithm is a procedure that transforms our input data into output predictions, but from a practical perspective an algorithm is just a computer program that takes in our data, does some computations and returns to us predictions.

One of the most basic and most important jobs a data scientist has is creating a –sort of– shortlist of algorithms that might be good at a specific modeling task, and evaluating them to see which one makes the best predictions. That's a pretty good summary of what we're building up to throughout this course with our Python knowledge.

I want to go over some of the most common categories of machine learning algorithms with you, but before that we need to establish a little bit more vocabulary in regards to our dataset.

The "Target" Variable.

In Data Science the thing that we want to predict gets a special name, it's called the "target" or "target variable". In our case, what do we want to predict? sale_price. So our target variable is sale_price.

Target variable and Atributes

However, there are a lot of other names for this variable that are synonymous with "target" that you need to be aware of. Some of these terms are used a little bit more heavily in certain fields than others:

Synonyms for "target" variable

  • y variable
  • dependent variable
  • output variable
  • response variable
  • predicted variable
  • experimental variable
  • measured variable
  • regressand
  • outcome variable
  • explained variable
  • label (particularly if the y variable is categorical)

What a pain, right? I'm sure there's even a few that I'm missing in this list, but these are all synonyms for the same thing. It's important that you realize this because different individuals will tend to use these different terms based on their background and the whim of the moment.

For example, most of my original statistical and mathematical training comes from the field of economics where we used exclusively linear-regression models to try and estimate the causal effect of some x variable on the y variable. So I tend to say "y variable" a lot. If you hear me say that you'll know that I'm talking about the "target" variable, or, (in other words) the "thing that we want to predict".

"Features" aka "Attributes", aka "Predictors"

The terms, "features," "attributes", or "predictors" are some of the most common machine learning terms used to refer to all of the other columns in our tabular dataset. And again, those are all synonyous with one another. They mean the same thing. In our case square_ft, lot_size_acres, and num_bedrooms are all "features" of our dataset, or, features of the unit of observation "houses".

Target variable and Atributes

As you may have guessed there are a lot of synonyms here too:

Synonyms for "feature" columns/variables

  • features
  • attributes
  • predictors or (predictor variables)
  • x variables
  • independent variables
  • dimensions
  • input variables
  • explanatory variables
  • manipulated variables
  • exposure variables
  • regressors
  • observed variables

Again, I'm sure there's some that I'm missing. If I've missed a big one that's popular in your field, let me know and I'll add it to the list in the video transcript below this video.

You'll notice how all of these terms are plural, because while we have just one y variable or "target" but will (almost always) have more than one x variable.

So, you'll hear me use the terms "features", "attributes", and "x variables", most often. I think that features and attributes are good terms for this, because those words capture how each additional column of data helps give us additional information almost like a new lens through which we can view our unit of observation (in this case, houses). Each new column holds a different feature of a house or a different attribute of a house.

Ok, so now that we understand that there's are different names given to the thing that we want to predict and all of our other data, Let's get a high level overview of what kinds of machine learning algorithms exist and how our y variable (target variable) is extremely important in driving the type of machine learning algorithms that we might try and use to capture the signal in the x variables in a way that will enable us to make the best predictions possible of the y variable.

There are many ways to categorize all of the different machine learning algorithms, but one common way is by grouping them into the groups of: reinforcement learning, supervised learning, unsupervised learning, and semi-supervised learning.

Reinforcement Learning

Let's start with reinforcement learning. Reinforcement Learning is the most different of the four categories that I've listed here. It deals with having an "agent" that makes decisions as it interacts with its surrounding world, and as it does so it is rewarded or punished for its behaviors with the goal of maximizing its reward. You may have seen examples of reinforcement learning if you've ever seen videos of these floppy balloon characters trying to learn how walk and jump, or maybe you've seen somebody training an agent to play a video game on its own. Some implementations of autonomously driving cars or robots that make their own decisions about how they will move around and interact with their world use reinforcement learning as well.

We're not going to deal with any reinforcement learning models or concepts in this course, or for quite some time after, as their applications are somewhat specific and specialized. But Temzee will definitely eventually teach some reinforcement learning techniques –at some point.

Supervised Learning

Supervised Learning is by far the most important category of machine learning algorithms, and the most powerful. In order to do "supervised" learning you have to have the thing that you want to predict included in your dataset. For example, with our housing data, we have have a sale_price column, and that's the thing that we want to predict, so we've got the green light to do some supervised learning. The majority of all machine learning that's done today is supervised learning.

Going back to our bread making machine analogy, imagine if I didn't have that perfect loaf of bread that I could hold up each time and say "No, machine, that's not right, please make it more like this." How would the machine ever arrive at the correct outcome if it couldn't compare its bad outputs to the desired output to help it improve?

This is where the term "supervised" comes from. The model is "supervised" in the sense that I can give it correction by comparing its prediction estimates to some recorded "ground truth" that is taken from real world observations and use that "truth" to then calculate how good or bad the models predictions are relative to that ground truth and give that calculation back to the model as feedback so that it knows if its predictions are improving or getting worse over its many iterations of supervised correction. We call this process of giving the model corrective feedback "training" the model.

Unsupervised Learning

So, I bet you could guess what unsupervised learning is, now. Unsupervised Learning represents those category of models where our target variable (the thing we want to predict), is not included within our dataset. Because we don't have that y variable it's much more difficult to give that corrective feedback to our model during the training process. Due to this, unsupervised learning tasks typically involve exploring the inherent structure or patterns found within the x variables and using that to draw conclusions about the data points. One common example of this is clustering algorithms, that try and segment the dataset into groups of similar data points. These kind of predictions are good for things like segmenting customers into different groups based on their buying patterns, or grouping books or other documents by their general category, etc.

Semi-supervised Learning

Semi-supervised learning is kind of a mixture between supervised and unsupervised learning. You would use semi-supervised learning when just a portion but not all of your dataset is labelled. This typically happens if a dataset is too large or too expensive for it to be practical to labelled manually. This scenario is more likely to arise with really big datasets. For example, imagine a dataset of images all of the galaxies that are observable from earth. It's estimated that there are over 1 trillion galaxies in the observable universe. So it's not practical to have a human look at all of them to classify them into categories according to their shape: (spiral, elliptical, irregular, etc). But having some labelled data is better than none, and can lead to a scenario where semi-supervised learning techniques might be useful.

Modeling Terminology

The Importance of Numbers in Machine Learning.

So what does this all have to do with numbers (you know, floats and integers) and what we've learned about Python so far?

Well, let's imagine a little bit more specific of an example. Imagine that you're a data scientist that is working with real estate data, and you're trying to develop a model that can predict (as accurately as possible) the sale price of properties, in this case, specifically single-family homes.

You might have data like this:

The goal is to make the best predictions possible. The more accurate and true to the real-world that our predictions are the more valuable those predictions tend to be.

However, the goal is not just to make a machine that does something. The goal is to make a machine that can learn how to do that something without being explicitly programmed to do that thing. Maybe even learn how to do anything. Can you imagine how valuable a machine would be if it could learn how to do anything without you

Data -> Data Preparation -> Modeling -> Evaluation -> Deployment

  • Model like a machine (inputs and outputs)
  • Tabular Data
  • Target, Features
  • No missing Data, All numbers
  • Algorithm
  • Supervised vs Unsupervised
  • Classification vs Regression

We're not here just to learn Python, we're here to learn Data Science, and Python is really just the world's most popular programming language for doing data science

Imagine that you are an analyst for a real estate investment company.

Machine Learning is the heart of data science, it's all about using numbers to make predictions.

Machine Learning uses Algorithms, which