FIX Draw indices using sample_weight in Forest #31529

antoinebaker · 2025-06-12T08:19:04Z

Part of #16298. Similar to #31414 (Bagging estimators) but for Forest estimators.

What does this implement/fix? Explain your changes.

When subsampling is activated (bootstrap=True), sample_weight are now used as probabilities to draw the indices. Forest estimators then pass the statistical repeated/weighted equivalence test.

Comments

This PR does not fix Forest estimators when bootstrap=False (no subsampling). sample_weight are still passed to the decision trees. Forest estimators then fail the statistical repeated/weighted equivalence test because the individual trees
also fail this test (probably because of tied splits in decision trees #23728).

TODO

choose how to generate indices in the sample_weight=None case
fix relative (float) max_samples as done in FIX Draw indices using sample_weight in Bagging #31414
docstrings
fix class_weight = "balanced" as done in Fix linear svc handling sample weights under class_weight="balanced" #30057
fix class_weight = "balanced_subsample"
changelog

github-actions · 2025-06-12T08:19:52Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: bff60ae. Link to the linter CI: here}

sklearn/ensemble/_forest.py

antoinebaker · 2025-06-12T09:14:29Z

The forest estimators now pass the statistical repeated/weighted equivalence test, for example

antoinebaker · 2025-06-16T07:49:21Z

Relative (float) max_samples, with the new meaning of drawing max_samples * sw_sum indices as done in #31414 , also passes the statistical repeated/weighted equivalence test

antoinebaker · 2025-06-25T09:29:58Z

The class_weight="balanced" option, now taking the sample_weight into account as in #30057, now passes the statistical repeated/weighted equivalence test

antoinebaker · 2025-06-25T09:37:08Z

The class_weight="balanced_subsampling" also passes, in that case sample_weight are used to draw the indices, the class_weight are then computed on the bootstraped sample for every grown tree and passed as sample_weight to the tree fit.

antoinebaker · 2025-06-27T10:08:39Z

sklearn/ensemble/_forest.py

+    if sample_weight is None:
+        sample_indices = random_instance.randint(0, n_samples, n_samples_bootstrap)


There two options for the random draw of indices when sample_weight=None

Convert to all ones

if sample_weight is None: sample_weight = np.ones(n_samples) normalized_sample_weight = sample_weight / np.sum(sample_weight) sample_indices = random_instance.choice( n_samples, n_samples_bootstrap, replace=True, p=normalized_sample_weight )

Use the old code path when sample_weight=None

if sample_weight is None: sample_indices = random_instance.randint(0, n_samples, n_samples_bootstrap) else: normalized_sample_weight = sample_weight / np.sum(sample_weight) sample_indices = random_instance.choice( n_samples, n_samples_bootstrap, replace=True, p=normalized_sample_weight, )

The two options use different rng functions: choice with uniform p for 1 and randint for 2. They are statistically the same but they don't give the same deterministic output for a given random state.

The benefit of 2. is that the code is backward compatible when sample_weight=None. A fit without sample_weight reproduce the same fit as main for a given random_state.

The benefit of 1. is that sample_weight=None and sample_weight=np.ones(n_samples) give the same fit for a given random_state.

Here we chose 2.

ogrisel

I haven't the time to finish my review today, but this looks great: I tried running the notebook of github.com/snath-xoc/sample-weight-audit-nondet/ against this branch and I confirm the statistical tests pass for RandomForestClassifier/Regressor and ExtraTreesClassifier/Regressor.

ogrisel · 2025-07-02T16:37:20Z

sklearn/ensemble/tests/test_voting.py

@@ -324,13 +325,13 @@ def test_parallel_fit(global_random_seed):
 def test_sample_weight(global_random_seed):
    """Tests sample_weight parameter of VotingClassifier"""
    clf1 = LogisticRegression(random_state=global_random_seed)
-    clf2 = RandomForestClassifier(n_estimators=10, random_state=global_random_seed)
+    clf2 = GradientBoostingClassifier(n_estimators=10, random_state=global_random_seed)


Why this change?

ogrisel · 2025-07-02T16:40:59Z

sklearn/ensemble/tests/test_forest.py

@@ -1167,7 +1167,7 @@ def test_class_weights(name):

    # Iris is balanced, so no effect expected for using 'balanced' weights
    clf1 = ForestClassifier(random_state=0)
-    clf1.fit(iris.data, iris.target)
+    clf1.fit(iris.data, iris.target, sample_weight=np.ones_like(iris.target))


Please add an inline comment to explain why leaving sample_weight=None does not work for this test.

use sample_weight in choice

9458a1c

github-actions bot added the module:ensemble label Jun 12, 2025

antoinebaker commented Jun 12, 2025

View reviewed changes

sklearn/ensemble/_forest.py Outdated Show resolved Hide resolved

antoinebaker added 2 commits June 13, 2025 17:40

use old code path

2f30d7d

relative max_samples

a55643b

antoinebaker and others added 8 commits June 19, 2025 14:50

adapt tests

bcad08a

Merge branch 'main' into random_forest_sample_weight

cce2060

changelog

f77059e

add relative max_sample test

eae5d27

fix class_weight

83c45db

replace rf by gbdt

8f28c4a

cleanup

8ca62a9

comment

c311a2f

antoinebaker and others added 5 commits June 25, 2025 11:42

typo

65f3ece

docstrings

2cf1700

undo changelog

e6e083b

typo

9201e5a

Merge branch 'main' into random_forest_sample_weight

bff60ae

antoinebaker commented Jun 27, 2025

View reviewed changes

antoinebaker marked this pull request as ready for review June 27, 2025 10:14

ogrisel reviewed Jul 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FIX Draw indices using sample_weight in Forest #31529

FIX Draw indices using sample_weight in Forest #31529

antoinebaker commented Jun 12, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

antoinebaker commented Jun 12, 2025

Uh oh!

antoinebaker commented Jun 16, 2025

Uh oh!

antoinebaker commented Jun 25, 2025

Uh oh!

antoinebaker commented Jun 25, 2025

Uh oh!

antoinebaker Jun 27, 2025 •

edited

Loading

Uh oh!

ogrisel left a comment

Uh oh!

ogrisel Jul 2, 2025

Uh oh!

ogrisel Jul 2, 2025

Uh oh!

Uh oh!

		if sample_weight is None:
		sample_indices = random_instance.randint(0, n_samples, n_samples_bootstrap)

Uh oh!

FIX Draw indices using sample_weight in Forest #31529

Are you sure you want to change the base?

FIX Draw indices using sample_weight in Forest #31529

Conversation

antoinebaker commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this implement/fix? Explain your changes.

Comments

Uh oh!

github-actions bot commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

antoinebaker commented Jun 12, 2025

Uh oh!

antoinebaker commented Jun 16, 2025

Uh oh!

antoinebaker commented Jun 25, 2025

Uh oh!

antoinebaker commented Jun 25, 2025

Uh oh!

antoinebaker Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

ogrisel Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

antoinebaker commented Jun 12, 2025 •

edited

Loading

github-actions bot commented Jun 12, 2025 •

edited

Loading

antoinebaker Jun 27, 2025 •

edited

Loading