Forecasting Profits Direction: Logistic Model
In the previous posts we developed a random forest model and walk-forward validation approach to forecast the direction of trades in a CAC40 dataset. Here we investigate the adoption of a logistic model instead of random forests. Since we still adopt a walk-forward validation, we will actually adopt an ensemble of logistic models, one for each of the walk-forward splitting.
Logistic modeling is quite common in binary classification problems, it is based on simple math which extends the linear regression approach, it has a nice bayesian interpretation and it can be seen as a single-layer perceptron so a first step in the world of neural networks and deep learning. Overall, it is a model that is worth testing!
# splitting training and testing dataset
part = createDataPartitionWT(y = data$dstart, p = .75);
# training ensemble logistic models
glmFit = list();
for(k in 1:part$num) {
training = data[part$train[[k]], fset];
set.seed(567);
ctrl = trainControl(method = "cv", number = 5, classProbs = TRUE, savePredictions = "all", allowParallel=T);
glmFit[[k]] = caret::train(class ~ ., data = training, method = "glm", family = "binomial", preProc = c("center", "scale"), trControl = ctrl);
}
Following the training, we can check the ROC plots for both the testing and verif datasets. The results are not surprising, sensitivity and specificity are much worse in the verification datasets. However, what is somehow surprising, is that comparing the ROC of the verif datasets with previous random forest model we will see that logistic classification seems to be worse and this is also confirmed by both the CM and the expected cumulative pnl for the coming week.
# verif results: CM
Confusion Matrix and Statistics
Reference
Prediction N P
N 5 7
P 7 7
Accuracy : 0.4615
95% CI : (0.2659, 0.6663)
No Information Rate : 0.5385
P-Value [Acc > NIR] : 0.8373
Kappa : -0.0833
Mcnemar's Test P-Value : 1.0000
Sensitivity : 0.5000
Specificity : 0.4167
Pos Pred Value : 0.5000
Neg Pred Value : 0.4167
Prevalence : 0.5385
Detection Rate : 0.2692
Detection Prevalence : 0.5385
Balanced Accuracy : 0.4583
'Positive' Class : P
auc(ROC);
Area under the curve: 0.494
Overall, the logistic model seems to be much less selective that the random forest one, we have more trades 14 instead of 6 but unfortunately the accuracy of forecasting is not improved, so resulting on a much worse expected pnl (around -40 instead of +7).
A nice thing about logistic model is that they support and easy interpretation of the forecasting result. It is sufficient to check the fitted models to identify the features which contributes most to the classification process. For example checking the seven logistic models by the summary() function we found that only three features are really important in the classification: ctr_tp, ctr_sl and ccl_area.
summary(glmFit[[7]]);
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8650 -1.0848 -0.8134 1.1854 1.9316
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.09850 0.08912 -1.105 0.26905
dstart -0.06235 0.09139 -0.682 0.49506
dow 0.10158 0.10002 1.016 0.30979
hour -0.05234 0.09246 -0.566 0.57130
pnl -0.38413 0.24534 -1.566 0.11741
pf -0.12350 0.11852 -1.042 0.29742
ctr_tp 0.68536 0.17379 3.944 8.03e-05 ***
ctr_sl -0.54886 0.18478 -2.970 0.00298 **
pnl_day 0.21208 0.17444 1.216 0.22405
pwrong_max 0.02765 0.10431 0.265 0.79092
ccl_area 0.47819 0.16730 2.858 0.00426 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
( Dispersion parameter for binomial family taken to be 1)
Null deviance: 745.86 on 538 degrees of freedom
Residual deviance: 712.07 on 528 degrees of freedom
AIC: 734.07
Number of Fisher Scoring iterations: 4
car::Anova(glmFit[[7]]$finalModel);
Analysis of Deviance Table (Type II tests)
Response: .outcome
LR Chisq Df Pr(>Chisq)
dstart 0.4660 1 0.494842
dow 1.0347 1 0.309051
hour 0.3208 1 0.571135
pnl 2.4643 1 0.116458
pf 1.1863 1 0.276073
ctr_tp 16.8027 1 4.147e-05 ***
ctr_sl 9.3435 1 0.002238 **
pnl_day 1.4855 1 0.222909
pwrong_max 0.0702 1 0.791025
ccl_area 8.6183 1 0.003328 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We could push the analysis a step further by retraining a logistic model considering only such “important” features and, then, by analyzing their crossed behaviors.
# fit a reduced set of features
training = data[part$train[[7]], fset];
fit = glm(class ~ ., data = training[c("ctr_tp", "ctr_sl", "ccl_area", "class")], family = "binomial");
# training results
summary(fit)
Call:
glm(formula = class ~ ., family = "binomial", data = training[c(6, 7, 10, 11)])
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8616 -1.0799 -0.8389 1.2237 1.7239
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.6088330 0.1720067 -3.540 0.000401 ***
ctr_tp 0.0623430 0.0146327 4.261 2.04e-05 ***
ctr_sl -0.0923482 0.0232163 -3.978 6.96e-05 ***
ccl_area 0.0017915 0.0006155 2.911 0.003606 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 745.86 on 538 degrees of freedom
Residual deviance: 716.97 on 535 degrees of freedom
AIC: 724.97
Number of Fisher Scoring iterations: 4
# plot features behavior
visreg(fit, scale="response", gg=T, xvar="ctr_tp", by="ccl_area", rag=2);
visreg(fit, scale="response", gg=T, xvar="ctr_sl", by="ccl_area", rag=2);
# plot ROC
df = data.frame(class = training$class, prob = fit$fitted.values);
ggplot(df, aes(d = class, m = prob)) + geom_roc() + style_roc() + labs(title="Ensemble ROC Plot (testing dataset)");
The probability of having a profit increase by increasing ctr_tp and decreasing ctr_sl as it was expected. More interesting, higher values of ccl_area contribute to a positive pnl but not in a linear way, it looks like only values greater than 100 are relevant.
The ROC plots of the seven models of the ensemble have similar shapes, and this is quite good news, it means that the overall behaviour of the logistic model with a reduced set of features is quite stable despite the walk-forward splitting. A threshold around 0.6 seems to be a common optimal point.
The adoption of such optimal threshold will result in an improved estimated pnl.