Predictive Analysis - Data Science from a Practical Perspective

an illustration of different predictive analysis tasks

As we discussed in the last post, predictive analysis is a family of analysis that aims to learn useful knowledge from data collected historically. The learned knowledge is then applied to future or new data to make inferences used for decision making. In this post, I will discuss the subtasks in predictive analysis including classification, regression, forecasting, clustering, anomaly detection, and association rules. But first, let us talk about the tools that are used in such tasks, which are analytical models.

Analytical Models

an illustration of a predictive analysis model that observes data to generate desirable outputs

Analytical models, or in short, models, are algorithms that take input as data to generate outputs for a given task. For example, you may have a model that observe students’ information to output their GPAs, given that the task is to predict students’ GPAs. In predictive analysis, models are the main (and probably the only) tool to derive the results that you want.

Most, if not all, predictive models must be trained, or fitted, before you can use them to make predictions or inferences on data. Training or fitting a model means to give it a set of sample data so that the model can learn how to solve the received task. A trained model can then predict targets for data similar to the one from which it learned. The data given to a model to train it is called training data.

Classification

Among the most common in predictive analysis is probably the task of classification. Classification means to assign or predict categories for instances in data based on their given attributes. The category to be predicted is called the target, or the label. Examples of classification are to predict whether a student is in poor, average, or good academic standing; whether a credit user is in good or bad financial standing; whether a patient is positive or negative with a diagnostic; and so on.

To solve the task of classification, surely we need to use classification models. A critically important note on classification models is that their training data must include both features and predefined targets. For instance, if you want to train a model to predict a patient’s diagnostic, the training data must come from patients who have already been diagnosed.

Regression

The regression task is fairly similar to classification in that you are trying to assign or predict labels of instances in data. The difference is that, in regression, the target is a continuous number. Some examples of regression is to predict monthly product sales based on product types, categories, and advertising spends; predict students’ first year GPAs based on their majors, high school GPAs, and daily study time; predict patients’ severities using their medical test measurements; and so on.

Also similar to classification, training data in regression must include both the features and the labels. Together, regression and classification form supervised learning, a branch in data analytics where model training needs predefined labels.

Forecasting

Forecasting is a special type of supervised learning in which your data has a time dimension. In other words, instances in the data are measurements of the same objects taken throughout a period of time. Forecasting models typically learn to predict future targets using historical and current feature data. Here, the target can be either categorical or numeric. Some examples of forecasting could be to predict future stock prices, predict if future stock trend will be up or down, predict future product sales, etc.

Clustering

Clustering is the task of determining natural groupings of instances in your data. In other words, clustering models decide the group to which each instance in data belong. In general, instances in the same group should be more similar than those in different groups.

Unlike supervised models, clustering models cannot use predefined labels to learn. On the other hand, they observe the given data and derive the labels (which are groups) themselves. For this reason, clustering belongs to unsupervised learning, another branch of data analytics. Model training in unsupervised learning is done without predefined targets.

An example of clustering is to find groupings of students based on their grades in math, physics, chemistry, literature, and history. In this case, a model may find clusters of student who are good at science subjects, good at social subjects, and good, average, and at both subjects. These group assignments are not defined in the training data but derived by the model.

Anomaly detection

Anomaly detection is the task of determining rare instances in data. These instances are usually very different from others, or abnormal, and exist in very few numbers. An example of anomalies is showed in the image above. In this case, we have air pressure measurements recorded from a production line. Normally, the pressures fluctuate in a fixed pattern. The two periods where the pressure patterns become flat or much more amplified can be considered anomalies in this case.

Similar to clustering, anomaly detection belong to unsupervised learning, and therefore its training data do not come with predefined labels.

Association learning

Association learning is the task of discovering correlations among entities in data. You should note that, here, entities may be different from data instances. For examples, in a patient data set, entities could be disease instead of patients, or in a students data set, entities could be courses, etc.

A very common application of association learning is the market basket problem. In the market basket problem, stores utilize their customers’ purchase history to determine their shopping patterns. The final analysis result is a set of rules suggesting that certain combinations of products likely lead to the purchase of other products. For example, if someone has already bought breads and cheeses, they are more likely to buy hams. Association learning is also a unsupervised problem.

To sum up

I discussed quite a few concepts in this post which hopefully are not overwhelming. However, I think all these predictive analysis tasks that we learn today are very interesting and applicable, and I hope you do too. We will definitely explore all of them one by one in my future posts.