Hey,t hank you for the great post! One question I had while reading this:
In the two-step process where you first extract an embedding from the decision tree’s geometry and then fit a logistic regression on top of it, aren’t we increasing the risk of overfitting by effectively using information about the target in the feature construction?
From how I see it, the embedding is derived after training the tree on labels, so the features passed to the logistic regression already contain target-informed structure. Doesn’t this leak label information into the features and make overfitting more likely?
Or do you think that the randomization/ensemble nature of the forests plus regularization on the logistic regression (e.g., ℓ₂) largely mitigates that risk?
Always use training/testing splits so that no data leakage. The purpose here is converting global linear representation and make the model interpretable. Thus, you can do regularized regression (Ridge or Logistics). You can always do logistic regression on one-hot-encoded terminal leaves. If you do the right L2 regularization, the result should be very close to xgboost. But the model will not be interpretable because of thousands OHE. Thus, I created compact global linear representation space: you get better prediction + inherently interpretable model.
I have a question. So steps are create stretch matrix and apply transform by stretch matrix multiplication with feature . Then fit model per leaf right ? ( expanding with zeros on rest of leaves same as fitting model per leaf i.e interacting beta with discrete variable ) . My question since it is fitting leaf with linear transformation ( stretch matrix ), would it be same as just fitting model per leaf on feature without any transform ?
Not the same. This is converting the tree split coordinates. The goal here is to get global linear representation where you can fit linear models on much reduced terms of linear representation so that you get better and interpretable models.
Hey,t hank you for the great post! One question I had while reading this:
In the two-step process where you first extract an embedding from the decision tree’s geometry and then fit a logistic regression on top of it, aren’t we increasing the risk of overfitting by effectively using information about the target in the feature construction?
From how I see it, the embedding is derived after training the tree on labels, so the features passed to the logistic regression already contain target-informed structure. Doesn’t this leak label information into the features and make overfitting more likely?
Or do you think that the randomization/ensemble nature of the forests plus regularization on the logistic regression (e.g., ℓ₂) largely mitigates that risk?
Always use training/testing splits so that no data leakage. The purpose here is converting global linear representation and make the model interpretable. Thus, you can do regularized regression (Ridge or Logistics). You can always do logistic regression on one-hot-encoded terminal leaves. If you do the right L2 regularization, the result should be very close to xgboost. But the model will not be interpretable because of thousands OHE. Thus, I created compact global linear representation space: you get better prediction + inherently interpretable model.
I have a question. So steps are create stretch matrix and apply transform by stretch matrix multiplication with feature . Then fit model per leaf right ? ( expanding with zeros on rest of leaves same as fitting model per leaf i.e interacting beta with discrete variable ) . My question since it is fitting leaf with linear transformation ( stretch matrix ), would it be same as just fitting model per leaf on feature without any transform ?
The leaf is for routing only. The model is the stretch matrix.
Not the same. This is converting the tree split coordinates. The goal here is to get global linear representation where you can fit linear models on much reduced terms of linear representation so that you get better and interpretable models.