After the previous post on data processing pipeline, we are now ready to move on to the analysis in data science! And just like I introduced in an old post, there are so many types of interesting analysis. While eventually, we will go through all of them, right now, we need to pick a start. So, let us begin with regression analysis, which is one the easier side of predictive analysis.
Regression analysis
From my practical perspective, regression analysis means to quantify the relationship between a numeric target and other features in data. Here, both the target and the features are columns in data and must present in the analysis. In general, the quantification comes in the form of a regression model – an algorithm that takes inputs as data in the features and outputs a prediction, a guess, for the target. The linking between features and targets is on a row basis. Specifically, the model correlate feature data in each row with that same row target.
In general, we can divide a regression analysis into three phases: training, evaluating, and inferring. During training, the model observes the input data including both the features and the true targets to learn a mapping between them. When training finishes, we usually want to know whether the model learned well or not, so we perform evaluation testing to verify that. Finally, if we are happy with the evaluations, we use the model to make inferences on data using the learned mapping. This is also called making predictions.
Of course, regression analysis is not always for the purpose of prediction. It is also used to explain relationships among features and targets too. But, the predictive type seem to be more prevalent now, so let us focus on that.
Illustrative example
We will start very simple with a small illustrative example on a data set that consists of two features, studytime
and testscore
of five students. As a regression analysis, we want to build a model that can predict any students’ testscore
using their studytime
.
studytime | 4 | 3 | 5 | 6 | 8 |
testscore | 70 | 68 | 76 | 85 | 90 |
After some times investigating and analyzing the data, we come up with the equation below (we will discuss how to do this in the next post). This is the training phase.
Now, how do we evaluate this model? The easiest way is to check how correct its predictions are. So, we apply the equation back on the data to obtain the predicted testscore
of the five students. Upon checking these predictions, we see that they are close enough to the true testscore, so we happily accept the model and finish the evaluation phase.
studytime | 4 | 3 | 5 | 6 | 8 |
equation | 53+5*4 | 53+5*3 | 53+5*5 | 53+5*6 | 53+5*8 |
predicted testscore | 73 | 68 | 78 | 83 | 93 |
true testscore | 70 | 68 | 76 | 85 | 90 |
With that, we can now infer the testscore
of any students if we know their studytime
. For example, if a student studied for 2
hours, we predict their score to be 53 + 5*2 = 63
, if one studied 10
hours, the score could be 53 + 5*10 = 103
, and so on.
Of course, an actual regression analysis is nowhere near this simple in all phases. Nevertheless, I hope this example is illustrative enough so that you understand the concept of a regression analysis.
Wrapping up
As you can see, there were no codes in this post. Do not worry, there will be a bunch coming later. This is just a quick introduction and illustration of what a regression analysis is like. Next, we will start discussing models for the regression task. So, see you again!
Pingback: Simple Linear Regression - Data Science from a Practical Perspective
Pingback: Classification Analysis - Data Science from a Practical Perspective