We’ll look at an example((movie budget and revenue)) of **linear regression** to help you grasp it.

Please don’t be confused; our real data for the example will not appear like the above; the above figure is only plotted for understanding.

**Our linear regression will get two kinds of data.**

**It will get our film production budgets and it will get our film revenues.**

**The budgets will be our feature,** also called **the independent variable **and **the revenue are** what we are trying to estimate – **that will be our target.**

What the **linear regression** will do is try and represent the relationship between the budget and the revenue as a straight line.

But here’s the question. **What kind of line?**

Let’s think back to high school math class and let’s think about what describes a line.

**From our math classes,**

we know that we can plot y as a function of X and that’s a line.

And if we cut the y-axis at 10 then we say that our line has **an intercept of 10** and if every time X increased by 2 and it made y increase by 1, then we say that the line has a slope that is equal to one half.

In that case our equation would look something like this: ** y = 1/2 x + 10**

And that means that the generic equation for line would be something like this.

It would be** y = mx + c, **where **m is the slope** and **c is the constant**.

So let me ask you this.

What part of the equation for the line would tell you about how **strong the relationship is between x and y?**

**In this case the slope is the key.**

The slope tells us how much y will change for a given change in X – **the larger the value of the slope the steeper the line **becomes.

Let’s take a look at an example where there is **no relationship between x and y.** If there is no relationship then we would simply have a straight line.

But if there is a relationship between the two then the slope would be quite steep and the stronger the relationship the steeper the slope.

But here’s the thing.

There’s a **big difference between machine learning and pure mathematics;** in machine learning,

In fact we even use a different notation. In our notation,

We will replace the c for the constant with **theta 0** and the **slope coefficient **will be written as **theta 1** and also we’ll change the order in this equation, so we’ll have the constant first and then the slope. And instead of **writing y,** what you’ll also often see is h **theta x **where h stands for hypothesis.

But at this point we still haven’t talked about where the line ultimately comes from.

**How do we know which line to draw? Looking at the data we just have data points.**

**There is actually no line, right?**

And as a matter of fact, you can draw a whole bunch of different lines through the same set of data points. **So, which line is best?**

Which line would you choose?

Which line has the best possible theta zero and best possible theta one?

*If our dataset looked just like this, our job would be easy.*

All we would have to do was connect all the data points with a straight line.All we would have to do was connect all the data points with a straight line.

And this also seems like the best option because we would know that in this case our estimates for theta zero and theta one would be very accurate.

**However, real data looks more like this.****If we were to draw a line through this data then there would always be a gap between the actual value and the line.**

In other words there would be a difference between the actual data point and the point on the line.

The point on the line here, that’s called the fitted value or the predicted value.

**But let’s talk more about these gaps because it’s these gaps that will help us choose the best possible intercept and the best possible slope for our line.**

**These white lines are actually called residuals.**

**Now, why will the residuals help us choose the best possible line for our data?**

Let me show you another line that we can draw to this data.

**The residuals are way bigger so the residuals can tell us something about how good the line is that we’re drawing on this chart.**

So now we have a measure by which to compare the different lines that we can draw through the data, all we have to do is look at the size of the residuals and choose the line with the smallest residuals.**And that’s great because now our algorithm has a very clear objective.**

The goal of our linear regression is going to be to calculate the line that minimizes these residuals.**But how exactly should that work?**

**In this case, what we can’t do is just add them up and find the lowest sum because that second data point is below the line.**That first residual is gonna be the **difference between the actual value**, the **y1, **and the predicted value which is the one on the line and that second residual would also be just the difference between the actual value in **white here** and the **fitted value in green.** the same is true for that third data point.

Now suppose we actually have calculated the values for these residuals and these residuals have the values 10, negative 6, and 4.

**We have a negative number here.**

So what we have to do instead, is we have to turn all of these numbers positive and the way we can do that is by squaring the residuals.

**Now what we’ve got is a single number – the sum of the squared residuals.**

This is the number that the linear regression will try to minimize in order to choose the best parameters for the line.

In other words; to find the best possible fit for our regression what we need to do is we need to choose an intercept – the **theta zero, **and we need to choose the slope – **theta one** that minimizes the sum of the squared residuals.

And you’ll see this number also being referred to as the residual sum of squares or** RSS.**