5 Comments
User's avatar
Vasco Gameiro's avatar

Hey,t hank you for the great post! One question I had while reading this:

In the two-step process where you first extract an embedding from the decision tree’s geometry and then fit a logistic regression on top of it, aren’t we increasing the risk of overfitting by effectively using information about the target in the feature construction?

From how I see it, the embedding is derived after training the tree on labels, so the features passed to the logistic regression already contain target-informed structure. Doesn’t this leak label information into the features and make overfitting more likely?

Or do you think that the randomization/ensemble nature of the forests plus regularization on the logistic regression (e.g., ℓ₂) largely mitigates that risk?

Agus Sudjianto's avatar

Always use training/testing splits so that no data leakage. The purpose here is converting global linear representation and make the model interpretable. Thus, you can do regularized regression (Ridge or Logistics). You can always do logistic regression on one-hot-encoded terminal leaves. If you do the right L2 regularization, the result should be very close to xgboost. But the model will not be interpretable because of thousands OHE. Thus, I created compact global linear representation space: you get better prediction + inherently interpretable model.

raghuram kowdeed's avatar

I have a question. So steps are create stretch matrix and apply transform by stretch matrix multiplication with feature . Then fit model per leaf right ? ( expanding with zeros on rest of leaves same as fitting model per leaf i.e interacting beta with discrete variable ) . My question since it is fitting leaf with linear transformation ( stretch matrix ), would it be same as just fitting model per leaf on feature without any transform ?

Agus Sudjianto's avatar

The leaf is for routing only. The model is the stretch matrix.

Agus Sudjianto's avatar

Not the same. This is converting the tree split coordinates. The goal here is to get global linear representation where you can fit linear models on much reduced terms of linear representation so that you get better and interpretable models.