Imagine you wanted to predict whether someone had a risk of contracting diabetes. There would be a number of factors to consider. Some of those factors would be their age, their sugar levels, their blood pressure, their family history, their weight, their body mass index (BMI) and so on. Unlike the single linear regression model we built earlier, there are more factors/attributes to work with.
In this example of a simple linear model, the income for the year 2020 was predicted using year alone as the independent variable. The problem with using only a single variable to train models and make predictions is that there will most likely be a big margin of error between the prediction the model will make and the actual value. In machine learning, this is referred to as the cost function. Take an example of the diabetes prediction example i mentioned earlier. Imagine using only the age to make a diabetes prediction.
Using the linear regression formula we had explored earlier : y = mx+c
It would look like this:
diabetes = (age * coefficient/gradient) + Constant — This is a single variable linear regression.
Using the multiple linear regression, then the model will take in more features than just the age. Hence: It would appear as such:
diabetes = (age*age’s coefficient) + (weight*weight’s coefficient) + (SugarLevels * SugarLevel Coefficient) + Constant
Note: For the equation above, I have selected just a few features for illustration purposes. However, the formula is the same. The selection of features you might want to use for your model training will vary depending on how knowledgeable you are in the domain or even if you are not, this is knowledge that can be obtained through learning and mastering Feature Engineering. I will be covering this aspect of machine learning as I progress through my weekly blogs.
(Just in case you are scratching your face wondering what the constant in the equation means, Think of it this way. You just got a new job and you have 0 years experience. That doesn't mean that they will pay you nothing. You probably will have to start from a certain standard lowest salary level. In this case, that number is referred to as the constant.)
Now that we discussed the idea of what a cost function is, there has to be a way of calculating that. Besides, so far, there hasn't been a way of telling whether the models we are building are performing well or not. Yet, that’s the whole point of machine learning. Moving on, we are going to be doing some weight-lifting. Good news is, it’ll start with some baby steps as we analyze the models and improve performance and even learn how to select better features for model training.
For this weeks simple multiple linear regression model, i used a fish data set from kaggle to predict the weight of fish given different dimensions of the fish. The Jupyter notebook is well commented and I dont assume that you know anything. So dont worry.
Expect the next model and blog next week Friday. (3/09/2021).
Thanks for reading!