DOC improved plot_semi_supervised_newsgroups.py example #31104

elhambbi · 2025-03-29T10:51:44Z

Reference Issues/PRs

Towards #30621 and with reference to PR #30882, the code for plot_semi_supervised_newsgroups.py is improved.

What does this implement/fix? Explain your changes.

The example and the methods used are described more
Old code is updated e.g. f-strings, prints etc
A bar plot is added to compare the F1 score of different approaches

Any other comments?

LabelSpreading, as a semi-supervised method, is not better than the fully supervised models in terms of F1 score. Should we keep it or is SelfTraining enough to demonstrate the superiority of semi-supervised methods when having limited data?

github-actions · 2025-03-29T10:52:53Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 59cb0eb. Link to the linter CI: here}

elhambbi · 2025-03-31T18:19:26Z

Hi @StefanieSenger . Could you please review the modifications I've done to this example and let me know if we need to change something?
Thank you.

StefanieSenger

Thanks a lot for your work, @elhambbi. It's going to upgrade the example a lot!

I especially like the part in the beginning that explains the intend of the example and summarises the approaches that get compared to each other.

I have a few little comments and also it would be nice to make the code more notebook style like and introduce some sub-sections.

Would you mind doing that?

StefanieSenger · 2025-04-09T11:52:49Z

examples/semi_supervised/plot_semi_supervised_newsgroups.py

+1. Supervised learning using 100% of labeled data (baseline)
+   - Uses SGDClassifier with TF-IDF features
+   - Represents the best possible performance with full supervision


The bullet points don't get rendered as intended. Can you please fix this?
You can see the rendering by building the documentation locally or by clicking "check the rendered docs" in the CI.

examples/semi_supervised/plot_semi_supervised_newsgroups.py

StefanieSenger · 2025-04-09T12:08:05Z

examples/semi_supervised/plot_semi_supervised_newsgroups.py

+    ls_pipeline, X_train, y_train_semi, X_test, y_test
+)
+
+# Create the plot


I think that instead of having comments like # Create the plot, # Define colors for each bar, # Customize plot, or # Add value labels on bars, it would be nice to add 1-2 sentences on top of this part explaining what will happen in this next section which would adhere more to the notebook style that we thrive for with the examples. This is a good example to look at, if you need some inspiration, @elhambbi: #26956

StefanieSenger · 2025-05-06T11:49:37Z

Hi @elhambbi, did you have some time to look into this?

elhambbi · 2025-05-06T16:15:27Z

Hi @StefanieSenger. I apologize, I've been too busy lately. I'll work on it over the weekend

StefanieSenger · 2025-05-07T07:13:18Z

Hi @elhambbi, no rush. I just wanted to know whether this PR is still active. Take your time.

elhambbi · 2025-05-24T20:16:50Z

Hi @StefanieSenger. Sorry, I've been quite busy lately. I have modified the code. Please let me know if it needs further improvement. Thank you

StefanieSenger

Thank you @elhambbi!

I have only got one little suggestion on the notebook style and some nit picks. Overall it looks pretty good to me.

Feel free to do more of the notebook style-like changes, but from my side with having these two blocks I'm pretty happy and would then forward this to a maintainer to have a look.

StefanieSenger · 2025-05-27T08:26:04Z

examples/semi_supervised/plot_semi_supervised_newsgroups.py

+The example uses the 20 newsgroups dataset, focusing on five categories.
+The results demonstrate how semi-supervised methods can achieve better
+performance than supervised learning with limited labeled data by
+effectively utilizing unlabeled samples.



Suggested change

(Sorry, this is really a nit.) We don't need this line.

examples/semi_supervised/plot_semi_supervised_newsgroups.py

elhambbi · 2025-06-01T15:27:34Z

Hi @StefanieSenger . I made the changes, and the file is updated now. I closed the PR by mistake and then reopened it. The commit history is gone but the code is correct with all the changes we discussed. Thank you

ogrisel · 2025-06-02T15:17:03Z

@elhambbi thanks for the PR. Could you please push a fix to add the missing new line as reported by the linter?

https://app.circleci.com/pipelines/github/scikit-learn/scikit-learn/67920/workflows/b45c787a-c741-4361-9085-15c44eb03ecd/jobs/306979

This would help get the Continuous Integration to proceed with building the HTML preview of example edited by the PR to make it easier to review.

ogrisel · 2025-06-02T15:19:36Z

examples/semi_supervised/plot_semi_supervised_newsgroups.py


-You can adjust the number of categories by giving their names to the dataset
-loader or setting them to `None` to get all 20 of them.
+1. Supervised learning using 100% of labeled data (baseline)


In wouldn't call this a baseline: if we do not have access to 100% labeled data, it will probably not be possible to reach the performance of this model. I would rather call it a "best case scenario".

ogrisel · 2025-06-02T15:19:53Z

examples/semi_supervised/plot_semi_supervised_newsgroups.py

+   - Uses SGDClassifier with TF-IDF features
+   - Represents the best possible performance with full supervision
+
+2. Supervised learning using only 20% of labeled data


This on the other hand could be called a "baseline" to compare semi-supervised learning methods to.

ogrisel · 2025-06-02T15:27:05Z

examples/semi_supervised/plot_semi_supervised_newsgroups.py


-    # select a mask of 20% of the train dataset
-    y_mask = np.random.rand(len(y_train)) < 0.2
+# Evaluate supervised model with 100% of training data


We should probably split each call to eval_and_get_f1 in its own cell by using # %% delimiters. Then the comments such as # Evaluate supervised model with 100% of training data should be slightly rephrased as grammatically correct sentences or paragraphs instead of just short inline code comments as @StefanieSenger suggested above.

StefanieSenger · 2025-06-02T15:52:03Z

Thanks for your work, @elhambbi! I see this really taking shape and that a core developer is now reviewing this (not just me as a team member) shows the progress of this PR. (Please: for expectation management, what I wrote in the What comes next? section of #30621 applies here: a PR like that is expected to be thoroughly reviewed and the result will be something a new contributor can be very proud of once it is finished.)

I will review more later. Just a little comment on the commit history: it has disappeared because of force-pushing, which we try to avoid at all costs. PRs are easier to review when we can see the commit history and as someone coming back to this PR regularily, I can rely on only looking at the newest changes instead of going through everything from the beginning. (Because I only have so much mental capacity and I jump a lot between very different topics and PRs.)

elhambbi · 2025-06-22T11:26:34Z

Hi @StefanieSenger. I improved the code based on what @ogrisel suggested.
I also added the link to this example in doc/modules/semi_supervised.rst as @adrinjalali mentioned in #30882 and closed it.
Please let me know your feedback. I'll improve it further, if needed. Thank you.

StefanieSenger

Hi @elhambbi,

thanks for your further work! 😃

In the file, I have left a few minor comments. It would be a nice service to users, to turn some of the mentions of classes or functions into cross-references. I left a few examples on the file, please do add some more where you find them suitable.

And a few would-like-to-have's, that I think would be great, but not required:

Add a short explanation why we use the micro-averaged f1_score here.
Link this example also in the docstring of f1_score, right in the average param, where the micro-argument is explained.

StefanieSenger · 2025-06-24T07:39:54Z

examples/semi_supervised/plot_semi_supervised_newsgroups.py


-You can adjust the number of categories by giving their names to the dataset
-loader or setting them to `None` to get all 20 of them.
+1. Supervised learning using 100% of labeled data (best-case scenario)


Suggested change

1. Supervised learning using 100% of labeled data (best-case scenario)

1. Supervised learning using 100% of data (best-case scenario)

Hm, now that I think about it, this could be misunderstood as if the data was filtered into labeled and unlabeled data, and then we apply supervised learning only on 100 % of the labeled part. It is determinable by reading the code, but it is better to be very clear here, I think. Do you see the same ambiguity or maybe it is actually all clear? I am actually not so sure here.

StefanieSenger · 2025-06-24T07:40:34Z

examples/semi_supervised/plot_semi_supervised_newsgroups.py

-    print("Supervised SGDClassifier on 100% of the data:")
-    eval_and_print_metrics(pipeline, X_train, y_train, X_test, y_test)
+# %%
+# 1. Evaluate a supervised SGDClassifier trained on 100% of the labeled data.


Suggested change

# 1. Evaluate a supervised SGDClassifier trained on 100% of the labeled data.

# 1. Evaluate a supervised SGDClassifier trained on 100% of the data.

Same comment as above.

StefanieSenger · 2025-06-24T07:41:34Z

examples/semi_supervised/plot_semi_supervised_newsgroups.py

+   - Uses SGDClassifier with TF-IDF features
+   - Represents the best possible performance with full supervision
+
+2. Supervised learning using only 20% of labeled data (baseline)


Suggested change

2. Supervised learning using only 20% of labeled data (baseline)

2. Supervised learning using only 20% of data (baseline)

Same comment as above.

StefanieSenger · 2025-06-24T07:43:04Z

examples/semi_supervised/plot_semi_supervised_newsgroups.py

+# 2. Evaluate a supervised SGDClassifier trained on only 20% of the labeled data.
+# This serves as a baseline to illustrate the performance drop caused by limited
+# labeled data.


Suggested change

# 2. Evaluate a supervised SGDClassifier trained on only 20% of the labeled data.

# This serves as a baseline to illustrate the performance drop caused by limited

# labeled data.

# 2. Evaluate a supervised SGDClassifier trained on only 20% of the data.

# This serves as a baseline to illustrate the performance drop caused by limiting

# the training samples.

Same comment as above.

StefanieSenger · 2025-06-24T08:06:24Z

examples/semi_supervised/plot_semi_supervised_newsgroups.py


+   - Uses SGDClassifier with TF-IDF features


Suggested change

- Uses SGDClassifier with TF-IDF features

- Uses :class:`~sklearn.linear_model.SGDClassifier` with :class:`TF-IDF <sklearn.feature_extraction.text.TfidfTransformer>` features

Adding Sphinx cross-references to link to other parts of our API docs. The second one I am actually not sure if it renders correctly (you would have to check).

For a function, you can use the :func: role in the beginning.

Here is a link to the sphinx docs on using cross-references, if you want to know more, but in general, it is enough to search for similar examples in scikit-learn and do it alike.

github-actions bot added the Documentation label Mar 29, 2025

StefanieSenger reviewed Apr 9, 2025

View reviewed changes

StefanieSenger self-requested a review May 27, 2025 08:38

StefanieSenger reviewed May 27, 2025

View reviewed changes

elhambbi closed this Jun 1, 2025

elhambbi force-pushed the cont03 branch from 93bf70d to 4560abc Compare June 1, 2025 15:09

fixed style

bd476af

elhambbi reopened this Jun 1, 2025

ogrisel reviewed Jun 2, 2025

View reviewed changes

adrinjalali mentioned this pull request Jun 4, 2025

DOC add link to plot_semi_supervised_newsgroups.py example in semi_supervised.rst #30882

Closed

StefanieSenger mentioned this pull request Jun 4, 2025

Add links to examples from the docstrings and user guide #30621

Open

StefanieSenger mentioned this pull request Jun 16, 2025

DOC add links to SVC vs Semi-Supervised example link to SVC subsection in "Mathematical formulation" in Support Vector Machines documentation #31499

Closed

elhambbi added 2 commits June 22, 2025 12:00

added link for plot_semi_supervised_newsgroups.py

826a365

improved code and comments

59cb0eb

elhambbi mentioned this pull request Jun 22, 2025

DOC Removed examples/semi_supervised/plot_semi_supervised_versus_svm_iris.py #31625

Open

StefanieSenger approved these changes Jun 24, 2025

View reviewed changes

	1. Supervised learning using 100% of labeled data (best-case scenario)
	1. Supervised learning using 100% of data (best-case scenario)

	# 1. Evaluate a supervised SGDClassifier trained on 100% of the labeled data.
	# 1. Evaluate a supervised SGDClassifier trained on 100% of the data.

	2. Supervised learning using only 20% of labeled data (baseline)
	2. Supervised learning using only 20% of data (baseline)

	- Uses SGDClassifier with TF-IDF features
	- Uses :class:`~sklearn.linear_model.SGDClassifier` with :class:`TF-IDF <sklearn.feature_extraction.text.TfidfTransformer>` features

Uh oh!

DOC improved plot_semi_supervised_newsgroups.py example #31104

Are you sure you want to change the base?

DOC improved plot_semi_supervised_newsgroups.py example #31104

Conversation

elhambbi commented Mar 29, 2025

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

elhambbi commented Mar 31, 2025

Uh oh!

StefanieSenger left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanieSenger commented May 6, 2025

Uh oh!

elhambbi commented May 6, 2025

Uh oh!

StefanieSenger commented May 7, 2025

Uh oh!

elhambbi commented May 24, 2025

Uh oh!

StefanieSenger left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elhambbi commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanieSenger commented Jun 2, 2025

Uh oh!

elhambbi commented Jun 22, 2025

Uh oh!

StefanieSenger left a comment

Choose a reason for hiding this comment

Uh oh!

StefanieSenger Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Mar 29, 2025 •

edited

Loading

StefanieSenger left a comment •

edited

Loading

StefanieSenger left a comment •

edited

Loading

elhambbi commented Jun 1, 2025 •

edited

Loading

ogrisel commented Jun 2, 2025 •

edited

Loading

StefanieSenger Jun 24, 2025 •

edited

Loading