
What is an ensemble technique?
In nutshell, ensemble techniques combine multiple individual techniques to build a more powerful technique, which allows us to make a better prediction and improve results compared to a single model. In ML courses, we might have read about bagging and boosting where many decision trees are combined to build powerful Random forest and XGb models.
In the above examples, you are combining similar kinds of models, like we are combining multiple decision trees to build a stronger model. But what if we want to combine different kinds of models like UCM, GAM, NN, Random forest, etc. One way will be to give weight to each of them, but it will be less flexibility.
To overcome these gaps, we are introducing the OPERA package which might be useful in many scenarios.
OPERA Theory
OPERA – How does it work?
Important termologies:
- Experts: Set of individual technique
- Mixture: to build the algorithm object
- Predict: to make the prediction by using the algorithm
- Oracle: to evaluate the performance of the experts and compare the performance of the combining algorithm
Mixture function():
mixture( Y = NULL, experts = NULL, model = "MLpol", loss.type = "square", loss.gradient = TRUE, coefficients = "Uniform", awake = NULL, parameters = list() )
Important Hyperparameters:
• Y: data stream to predict
• Experts: the set of experts
• Model: aggregation methods (EWA, FS, Ridge, MLpol, OGD)
• Loss.type: the loss function (absolute, square, percentage,…)
Oracle function():
oracle( Y, experts, model = "convex", loss.type = "square", awake = NULL, lambda = NULL, niter = NULL, ... )
Important Hyperparameters:
• ‘expert’: The best fixed (constant over time) expert oracle.
• ‘convex’: The best fixed convex combination (vector of non-negative weights that sum to 1)
• ‘linear’: The best fixed linear combination of expert
• ‘shifting’: It computes for all number $m$ of switches the sequence of experts with at most $m$ shifts that would have performed the best to predict the sequence of observations in Y.
In the below image, you can see how important to experts varies across time frame.
OPERA Implementation using Real Data
## Using Tourist data dt <- read.csv("ts_visitors.csv") dt <- ts(dt$United.Kingdom, start = c(1998,4), frequency = 4) train <- window(dt, end = c(2009,4), start = c(1999,1)) test <- window(dt,start = c(2010,1)) forecast.period <- length(test)
You can use the attached R code for detail code. Here I am giving a glance to important steps. We are using Tourist data which you can download from here. Download data
########### Example 2: COMPLETE MODEL - TRAINING WITH 3 EXPERTS expert.1 <- forecast(auto.arima(train), h = forecast.period) expert.2 <- forecast(ets(train), h = forecast.period) expert.3 <- forecast(tbats(train), forecast.period)
########### Building Oracle funciton MLpol <- mixture(Y = train, experts = train.experts, loss.type = "square", model = "MLpol") oracle.convex <- oracle(Y = train, experts = train.experts, loss.type = "square", model = "convex")
########### Prediciting using above trained MLpol function z <- ts(predict(MLpol, test.experts, y = null, online = F, type = "response"), start = c(2010,1), frequency = 4)
########### Calculating MAPE MAPE <- abs(sum(qc.data$z) - sum(qc.data$test))/sum(qc.data$test) MAPE
Limitation and drawbacks
1. As this is an ensemble technique, finding feature importance will be challenging
2. The model tends to overfit on training data and might not perform well on testing data
3. The model weights are static in sequential order and might not work in the same order in testing data
Git Link: Download code from here.
References:
1. Research Gate Paper
2. Package Details on CRAN
3. Package Details on rdrr.io
4. Forecasting combinations by Rob J Hyndman
Great Article it its really informative and innovative keep us posted with new updates. its was really valuable. thanks a lot.
HI Deepesh,
Looking forward to follow you & your career!
Nina
Hi Nina – thanks a lot for finding out time and reading my blog. That’s so nice of you. Looking forward to learn many things.
Deepesh Singh
Sorry if you find my comment a bit blunt but you miss the whole story of the algorithm : indeed this is an ensemble learning algorithm but its goal is to predict as well as the best convex or linear combination in hindsight, without knowing the data in hindsight (that would be cheating !). For example you can perform online linear regression (with the method “Ridge”) and get results close to (and sometimes even better than) the linear regression you could have performed if you knew all the data in hindsight. On top of that, the theoretical roots of the algorithms are rock solid. Here are some valuable properties that make this family of algorithm outstanding.
If you want an introduction to the subject, here is a presentation Gilles Stoltz, Pierre Gaillard’s PhD advisor : https://www.youtube.com/watch?v=JcEU8lLyUcA
By the way, Pierre Gaillard is now a researcher at the INRIA and the author of the opera package and giving him some credit is the least you could have done.
PS : I’m not Pierre Gaillard but a huge fan of his work.
Thanks a lot for coming across my blog and sharing your thoughts. What a wonder explanation. Thanks a lot man!