• Save
  • Run All Cells
  • Clear All Output
  • Runtime
  • Download
  • Difficulty Rating

Loading Runtime

Data is everywhere!

One of the coolest things about data science is that pretty much everything in the world throws off data. As long as there's someone who cares enough to actually collect and record the information that's inherent in the world around us, then data science can be applied to it.

What this means is that if you have any nerdy –interest, obsession, hobby, or passing fancy in a topic, you can probably find data relating to it that can be analyzed and used to make predictions.

As you grow more competent in data science techniques and in the sourcing of datasets I would encourage you to keep an eye out for particularly interesting datasets that relate to topics that you're passionate about. I often encourage students to build portfolio projects with data that's exciting to them –personally– even if that data wouldn't necessarily be valuable in a business context. It's a lot more fun and interesting to develop and demonstrate your data science proficiency by delving into topics that you care about.

I've seen impressive student projects that use data about UFO sightings, Simpsons episodes, bee keeping, sports statistics, Magic the Gathering, video games, and dozens of other cool topics. Whatever you love, you can apply data science to it.

Ok Ryan, why are you preaching from your soapbox about sources of data at this point in the course? Well, because in this lesson we're going to introduce the topic of Python Lists, and lists are perhaps the most fundamental way that we can store data in Python. As we start to introduce lists I'm going to try my best to help these fundamental Python topics come alive for us by using actual data from lots of different domains. Some of which you may have never considered to be candidates for the application of "Data Science" techniques.

Let's do it, even though we've still got to cross our Ts and dot our Is in terms of thoroughly covering the minutiae of the Python programming language, this is where things start to get fun.

Our first "Data Structure"

To demonstrate why lists are going to be so useful to us, let's first try and do some very simple analysis without using lists.

Something that you should know about me –I am kind of into birds. I'm not like a hardcore birdwatcher, but I think birds are pretty underrated. I am particularly fascinated with penguins because they're so weird and cute. The way they have evolutionarily adapted to their environment over time is just amazing to me. So... in the vein of using datasets that are interesting to us personally, let's analyze some penguin data!

The Aquarium of the Pacific in Long Beach California –as of the time of the writing of this lesson– houses 15 Magellanic penguins in their June Keyes Penguin Habitat –and they even have a webcam! Their names are: Anderson, Floyd, Gatz, Heidi, Kate, Lily, Ludwig, Mattson, Paddles, Patsy, Robbie, Roxie, Shim, Whatever, and Wally. No really, one of the female penguins is named "Whatever".

Zookeepers have to take measurements of their animals pretty regularly to learn more about the animals but above all as an indicator of the health of the animals. The London Zoo famously has an annual weigh-in event where they have to record a weight for each of their some-odd 14,000 animals. Can you imagine trying to get a monkey to hold still long enough to weight it? How do you even coax it to step onto a scale in the first place? How do you weigh a leaf-cutter ant, or a Polar Bear for that matter? I don't know, but there are people in the world who get paid to do these things which is just so dope. That data must at least be valuable enough to the zoos and other scientists for them to go to the trouble of collecting it.

Anyway, here are the heights of each of our fifteen penguin friends in centimeters all stored to variables. (Unfortunately for this example this isn't real data. I've had to make it up, but I'm sure the aquarium actually has this data in a database somewhere.)

Alright, there's the heights of all 15 penguins. Females tend to be smaller than males.

Now, let's try and do something with this data. Something basic.

Question for you –and if you can answer this with code, even better. What is the average (or mean) height of these 15 penguins?

Well, to find the mean we have to add all of the values up and then divide by the number of values that we're dealing with (15). Let's do it –this is going to suck to type out.

By the way, the \ symbols (backslashes) that you see in the code below are not doing division. These are line breaks that I'm using to wrap what would ordinarily be a single long line of code into multiple lines so that it fits on the screen better. This is called an "explicit line break" and it's a trick you can use with long lines of code. Hopefully you don't have lines of code this long very often, because this code... well let's just say... it isn't the best.

The division symbol in Python is the forward slash: /.

69.26666666666667

Alright! The average height of the penguins in this exhibit is ~69.27 centimeters tall.

But man, wasn't that a pain to type out? Wouldn't it be nice if we could hold all of those related measurement values in some kind of a container? Or some kind of an, I dunno... a cough "structure" of data that would make it easier to work with them, so that we wouldn't have to type out each variable name one by one?

Well we can –and should– use the "data structure" of a list to do this.

What is a Data Structure?

Up to this point we've been storing individual integers, booleans, floats, and strings to variables. But what if we could store multiple values within a single variable at the same time? Well, that's what a data structure is. Data structures are containers that can hold multiple values at a time. And we then we store the entire container of multiple values to a single variable (rather than 15 variables –in this specific case). There are a few different "data structures" that exist within the Python programming languages, in my opinion lists are the most important data structure for data scientists to know, that's why we're covering it first.

Here's how we would store all of our penguin_heights to a list in Python:

[74, 73, 74, 68, 65, 63, 72, 72, 65, 61, 72, 64, 73, 68, 75]

Our variable this time is called penguin_heights –I often find myself using plural variable names because now I'm storing multiple values inside of this list. Then I tell Python where the list starts and ends by using a pair of square brackets: [] (found to the right of the letter p on your keyboard). And then, inside of the square brackets we list off the different values separated by commas , to tell python where one value ends and then next one starts. And there you have it, a list of penguin_heights

Calculating an average (mean) using lists

Now that we've got all of these values inside of the list we can perform operations on the entire list at once rather than on individual values.

For example we can sum up the entire list by using the built-in sum() function.

1039

We can also find out the number of items held within the list with the built-in len() function. (which is short for "length")

The technical name for a list item is an "element" by the way.

15

So we have 15 "elements" (items) in our list of penguin_heights.

Putting these two operations together with a division sign and we'll get our mean_penguin_height.

69.26666666666667

So, by using lists we're able to hold not just individual values, but a whole collection of useful and related data points. And now we can do useful things to this data while making our code simpler and more readable –and with much less typing. This code above works the same (and requires the same amount of typing) no matter if we have 15 penguins or 1,000,000 penguins, All because we can use lists to store multiple pieces of information to a single variable.