Skip to content

Fix sample weight handling in SAG(A) #31675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

snath-xoc
Copy link
Contributor

Reference Issues/PRs

Fixes issue on sample weight handling within SAG(A) #31536.

What does this implement/fix? Explain your changes.

SAG(A) now accounts for sample weights by:

  • Applying a weighted sampling of random indices when sample weights are not None. This should be equivalent to sampling from a repeated dataset uniformly (i.e., frequency based weighting)
  • Calculates the step size by accounting for sample weights

TO DO:

  • Apply the same sample weight corrections in get_auto_step_size

Copy link

github-actions bot commented Jun 28, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 8615425. Link to the linter CI: here

@snath-xoc
Copy link
Contributor Author

Running the test as follows

import numpy as np
from scipy.stats import kstest
from sklearn.linear_model.tests.test_sag import sag, squared_dloss, get_step_size
from sklearn.datasets import make_regression, make_classification
from sklearn.utils._testing import assert_allclose_dense_sparse

alpha=1

n_features = 6

rng = np.random.RandomState(0)
X, y = make_classification(n_samples=1000,random_state=77,n_features=n_features)
weights = rng.randint(0,5,size=X.shape[0])

X_repeated = np.repeat(X,weights,axis=0)
y_repeated = np.repeat(y,weights,axis=0)

weights_w_all = np.zeros([n_features,50])
weights_r_all = np.zeros([n_features,50])

step_size_w=get_step_size(X,alpha,True,sample_weight=weights)
step_size_r= get_step_size(X_repeated,alpha,True)

for random_state in np.arange(50):

    weights_w, int_w = sag(X,y,step_size=step_size_w,sample_weight=weights,alpha=alpha,dloss=squared_dloss,random_state=random_state)
    weights_w_all[:,random_state] = weights_w
    weights_r, int_r = sag(X_repeated,y_repeated,step_size=step_size_r,alpha=alpha,dloss=squared_dloss,random_state=random_state)
    weights_r_all[:,random_state] = weights_r

print(kstest(weights_r_all[0],weights_w_all[0]))

Now gives the result

KstestResult(statistic=np.float64(0.2), pvalue=np.float64(0.2719135601522248), statistic_location=np.float64(-0.004336382594871251), statistic_sign=np.int8(1))

image

@ogrisel
Copy link
Member

ogrisel commented Jul 1, 2025

@snath-xoc at the meeting, you mentioned that the statistical test would not pass for some datasets. Could you please post an example and add a TODO item to the PR to investigate this problem.

Also, whenever penalty is non-zero, the problem is strictly convex and the solution show be unique. So it should be possible to write deterministic tests (with various random seed values) instead of statistical tests to:

  • craft a minimal reproducer that does not evolve running a KS-test;
  • check whether the proposed fixed can fix the bug for all possible seeds in the [0, 99] range for instance.

This might require setting tol to a low enough value, max_iter to a large enough value, and checking that no ConvergenceWarning is raised.

Comment on lines -99 to 106
if sample_weight is not None:
gradient *= sample_weight[idx]

update = entry * gradient + alpha * weights
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should keep the gradient multiplied by the sample_weight even if you change the sampling of idx, otherwise you are not computing the correct loss gradient.

Copy link
Contributor

@antoinebaker antoinebaker Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loss with $\ell_2$ regularizer to minimize wrt $w$=weights coefficient of the linear model (skipping the intercept for now):
$loss(w) = \frac{1}{S} \sum_i s_i l(y_i, w^Tx_i) + \frac{\alpha}{2} \Vert w \Vert^2$
can be put in several ways as the finite sum minimized by sag(a):
$g(w) = \frac{1}{n} \sum_i f_i(w) \propto loss(w)$

Let $l_i(w) = l(y_i, w^Tx_i)$ be the data $i$ loss term. We can choose:

  1. $f_i(w) = s_i l_i(w) + s_i \frac{\alpha}{2} \Vert w \Vert^2 $. The gradient updated and stored in memory is then update = sample_weight[idx] * (entry * gradient + alpha * weights)
  2. $f_i(w) = s_i l_i(w) + \frac{S}{n} \frac{\alpha}{2} \Vert w \Vert^2$. This gives the code in main where only the data $i$ loss term is weighted.
  3. $f_i(w) = s_i l_i(w)$. The gradient updated and stored in memory is then update = sample_weight[idx] * entry * gradient, ie only the data $i$ loss term. The regularizer is then taken into account when doing the gradient descent step on weights, for example weights -= alpha*step_size*weights.

I feel 2. (the current code) is a bit weird, for instance there is maybe a weird scaling of alpha as $\frac{S}{n}\alpha$. However all three options lead to the same updated weights at the end, so it shouldn't matter too much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have a slight preference for option 1. with the meaning of adding a term proportional to $s_i$, in which case it would make sense to redefine n_seen as the weighted sum of seen indices, which will converge to $S$ instead of $n$.

Comment on lines +247 to +252
def get_step_size(X, alpha, fit_intercept, classification=True, sample_weight=None):
if sample_weight is None:
X_prod = np.sum(X * X, axis=1)
else:
X = X[sample_weight != 0]
X_prod = np.sum(X * X, axis=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think removing the zero sample_weight is enough.

If I understood correctly the sag(a) paper the recommended step size is step_size = 1 / L where $L = \max_i L_i$ and $L_i$ is the Lipschitz smoothness constant of $f_i(w)$.

It will depend on how we decompose the weighted sum into individual $f_i(w)$ terms. With the options above:

  1. $L_i = s_i \kappa \Vert x_i \Vert^2 + s_i \alpha$.
  2. $L_i = s_i \kappa \Vert x_i \Vert^2 + \frac{S}{n} \alpha$.

where $\kappa = 1$ for regression and $\kappa = 1/4$ for classification.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end, it depends on how we allocate the regularizer:

  • evenly wrt the indices, each $i$ has the same regularizer $\frac{S}{n} \frac{\alpha}{2}\Vert w \Vert^2$
  • evenly wrt the sample_weight, each $i$ has a regularizer $s_i \frac{\alpha}{2}\Vert w \Vert^2$ proportional to its sample_weight.

Maybe we should try both strategies and see which one leads to better convergence ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we first need to figure out the step size computation before trying non-uniform sampling, for example by following 5.5 Effect of non-uniform sampling of the sag paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants