Skip to content

Unbiased MDI-like feature importance measure for random forests #31279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 81 commits into
base: main
Choose a base branch
from

Conversation

GaetandeCast
Copy link
Contributor

@GaetandeCast GaetandeCast commented Apr 30, 2025

Reference Issues/PRs

Fixes #20059

What does this implement/fix? Explain your changes.

This implements two methods that correct the cardinality bias of the feature_importances_ attribute of random forest estimators by leveraging out-of-bag (oob) samples.
The first method is derived from Unbiased Measurement of Feature Importance in Tree-Based Methods, Zhengze Zhou & Giles Hooker. The corresponding attribute is named ufi_feature_importances_.
The second method is derived from A Debiased MDI Feature Importance Measure for Random Forests, Xiao Li et al.. The corresponding attribute is named mdi_oob_feature_importances_.
The names are temporary, we are still seeking a way of favoring one method over the other (currently investigating whether one of the two reaches asymptotic behavior faster than the other).

These attributes are set by the fit method after training, if the parameter oob_score is set to True. In this case we send the oob samples to a Cython method at tree level that propagates them through the tree and returns the corresponding oob prediction function and feature importance measure.

This new feature importance measure has a similar behavior to regular Mean Decrease Impurity but mixes the in-bag and out-of-bag values of each node instead of using the in-bag impurity. The two proposed method differ in the way they mix in-bag and oob samples.

This PR also includes these two new feature importance measures to the test suite, specifically in test_forest.py. Existing tests are widened to test these two measures and new tests are added to make sure they behave correctly (e.g. they coincide with values given by the code of the cited papers, they recover traditional MDI when used on in-bag samples).

Any other comments?

The papers only suggest fixes for trees built with the Gini (classification) and Mean Squared Error (regression) criteria, but we would like the new methods to support the other available criteria in scikit-learn. log_loss support was added for classification with the ufi method by generalizing the idea of mixing in-bag and oob samples.

Some CPU and memory profiling was done to ensure that the computational overhead was controlled enough compared to the cost of model fitting for large enough datasets.

Support for sparse matrix input should be added soon.

This work is done in close colaboration with @ogrisel.

TODO:

  • Fix the tests related to oob_score_
    Done in d198f20
  • Add support for sparse input data (scipy sparse matrix and scipy sparse array containers).
    support: 8329b3b
    test: 0b48af4
  • Add support and tests for sample_weight
    Support added in f10721e. Test in 241de66
  • Expose the feature for GradientBoostingClassifier and GradientBoostintRegressor when row-wise (sub)sampling is enabled at training time.
    Done in ce52159
  • Shall we expose some public method to allow the user to pass held-out data instead of just computing the importance using OOB samples identified at training time?
  • Separate gradient boosting from this pr
    8a09b39
  • Update doc example on permutation vs mdi to include ufi & mdi_oob
    229cc4d
  • Think about an API to expose feature importance confidence intervals based on tree level booststraping

Edit: We noticed a discrepancy between the formula defined by the authors of mdi_oob and what their code does. This is detailed here, in part 5. Therefore we only implement UFI for now. Furthermore we could not find an equivalent of ufi for the entropy impurity criterion so we compute ufi with gini whatever the impurity criterion in classification, and with mse for classification

GaetandeCast and others added 30 commits April 14, 2025 17:43
…d that they coincide with feature_importances_ on inbag samples
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a pass of feedback. I think many paragraphs of the example should be reworked, the intro in particular.

I will try to find the time to do a full review of this PR in the coming week.

@GaetandeCast GaetandeCast marked this pull request as ready for review June 18, 2025 14:00
Copy link
Contributor

@antoinebaker antoinebaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here a first pass ! I didn't look at the cython code, examples or tests yet, I'll try at another time :)

@GaetandeCast GaetandeCast force-pushed the unbiased-feature-importance branch from cd131bc to 83c0a1a Compare June 19, 2025 09:50
@GaetandeCast
Copy link
Contributor Author

GaetandeCast commented Jun 25, 2025

I added return_as="generator" for memory performance which will break the tests on the CI machines that use joblib version 1.2, until the minimum version is bumped to 1.3

@antoinebaker
Copy link
Contributor

I added return_as="generator" for memory performance which will break the tests on the CI machines that use joblib version 1.2, until the minimum version is bumped to 1.3

You could maybe condition on the joblib version ?

@GaetandeCast
Copy link
Contributor Author

GaetandeCast commented Jun 25, 2025

You could maybe condition on the joblib version ?

We opened a PR to bump to the next version anyway as it will be old enough by next release 🤷‍♂️

@lorentzenchr
Copy link
Member

Can someone give me a short summary and in particular the formulae of the implemented features importances?

@GaetandeCast
Copy link
Contributor Author

GaetandeCast commented Jul 2, 2025

Can someone give me a short summary and in particular the formulae of the implemented features importances?

The classical MDI uses the impurity of the nodes computed during training, which in binary classification is :
$H(t) = 1 - p_{t,0}^2 - (1 - p_{t,0})^2$. The authors of UFI suggest using a modified impurity measure: $H'(t) = 1 - p_{t,0} p^\prime_{t,0} - (1 - p_{t,0})(1 - p^\prime_{t,0})$ where $p_{t,0}$ and $p^\prime_{t,0}$ are the proportion of training and testing samples respectively of class 0 that go through the evaluated node $t$. Since we are doing bagging in Random Forests, these test samples are readily available in the out-of-bag (oob) samples.

Once these modified impurity measures are computed with the oob samples, we compute the same decrease in impurity as in traditional MDI : $\Delta^\prime (t) = \omega_t H^\prime (t) - \omega_l H^\prime (l) - \omega_r H^\prime (r)$ where $l$ and $r$ designate the left and right children of $t$ and $\omega_t$ is the proportion of training samples going through $t$.

This can be extended to multi-class classification and in regression we use a similar idea to compute node impurity that uses in and out of bag information : $H^\prime(t) = \frac{1}{n^\prime_t} \sum_{x^\prime_i \in R_t}(y^\prime_i - \bar{y}_t)^2$. $\bar{y}_t$ is the node value, $(x^\prime_i, y^\prime_i)_i$ are oob samples.

I went into more details in this document, where I also compare this method to the alternative proposed in this paper.

@lorentzenchr

@lorentzenchr
Copy link
Member

lorentzenchr commented Jul 2, 2025

@GaetandeCast Thanks a lot for your summary.

TLDR (summary)

What is the difference to the out-of-bag (oob) score of a random forest?

Details

The Brier score in binary classification ($y \in {0, 1}$) is a different name for the squared error: $BS(y_{obs}, p_{pred}) = (y_{obs} - p_{pred})^2$.
Each tree node minimizes this score $\sum_{i \in node} BS(y_{obs, i}, p_{pred, i})$, meaning it predicts $p_{pred} = \bar{y} = average(y \in node) = \frac{1}{n_{node}}\sum_{i \in node} y_i$ (so we predict probabilities like good statisticians).
Plugging it back into the Brier score gives the Brier score entropy, aka Gini score, $Gini = \sum_{i \in node} \bar{y}(1 - \bar{y})$.

If one wants to use the oob sample instead of the training sample then one simply can use the Brier score. The same logic applies to all (strictly consistent) loss functions (and to regression).

From the formula of the UFI suggestion, I am unable to recover the Brier score on the oob sample. Neither can I find any derivation in their paper. I also think it is plain wrong because they mix the empirical means of 2 independent data samples, i.e., $p$ and $p^\prime$. Unless I have not understood something fundamental, that, honestly, seems like nonsense to me.

Edit: If one restricts to the oob sample that flows through that fixed node, meaning p_pred is constant, than one has $\bar{BS}=\bar{y}(1-2p_{pred})+p_{pred}^2$.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unbiased mean decrease in impurity if tree-based methods
4 participants