Skip to content

Java: Add query to detect special characters in string literals #19875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

tamasvajk
Copy link
Contributor

@tamasvajk tamasvajk commented Jun 25, 2025

This pull request introduces a new query to detect non-explicit control and whitespace characters in Java literals, improving code readability and reducing potential bugs caused by invisible or hard-to-recognize characters.

Manually testing autofix works, the special characters are replaced with \u0....

@@ -0,0 +1 @@
Violations of Best Practice/SpecialCharactersInLiterals/NonExplicitControlAndWhitespaceCharsInLiterals.ql

Check warning

Code scanning / CodeQL

Query test without inline test expectations Warning test

Query test does not use inline test expectations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider an inline expectations test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I quickly turned it into an inline expectations test, but it failed, I think due to the fact that the strings contain multiple special characters. So then I reverted it, and kept it as it is.

@tamasvajk tamasvajk force-pushed the quality/spec_chars branch 3 times, most recently from e1b966e to 9db9200 Compare June 25, 2025 12:15
@tamasvajk tamasvajk added the no-change-note-required This PR does not need a change note label Jun 25, 2025
@tamasvajk tamasvajk force-pushed the quality/spec_chars branch from 9db9200 to eeaaa87 Compare June 27, 2025 08:32
@tamasvajk tamasvajk force-pushed the quality/spec_chars branch from eeaaa87 to 834cb1b Compare July 1, 2025 09:38
@tamasvajk tamasvajk marked this pull request as ready for review July 1, 2025 12:07
@tamasvajk tamasvajk requested a review from a team as a code owner July 1, 2025 12:07
Comment on lines 10 to 12
* correctness
* maintainability
* readability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking: you believe this matches both the correctness and readability categories?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think readability is definitely okay. correctness might/could be challenged. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can belong to both. But between maintainability and reliability, you think it's better to choose maintainability, because readability is definitely an issue whereas correctness is only occasionally an issue?

Copy link
Contributor Author

@tamasvajk tamasvajk Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't given it this much thought. I simply thought maintainability goes together with readability. I think we could remove the correctness tag, and then it fully matches the guidelines, of having one top level tag and one subcategory from that tag.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, I think it's fine as it is actually.

Copy link
Contributor

@owen-mc owen-mc Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I just read the description of this PR and it says that we shouldn't use subcategories from different top-level categories. I will discuss the issue on that PR. Maybe wait until that discussion is complete to see if we should remove correctness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to remove correctness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already removed it. I think it's okay if we keep the tags to a minimum for the time being.

@@ -0,0 +1 @@
Violations of Best Practice/SpecialCharactersInLiterals/NonExplicitControlAndWhitespaceCharsInLiterals.ql
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider an inline expectations test?

@tamasvajk tamasvajk added the ready-for-doc-review This PR requires and is ready for review from the GitHub docs team. label Jul 2, 2025
Comment on lines 107 to 111
- [Unicode Control Characters](https://www.unicode.org/charts/PDF/U0000.pdf).
- [Unicode C0 control codes](https://en.wikipedia.org/wiki/C0_and_C1_control_codes).
- [Unicode characters with property "WSpace=yes" or "White_Space=yes"](https://en.wikipedia.org/wiki/Unicode_character_property#Whitespace).
- [Java String Literals](https://docs.oracle.com/javase/tutorial/java/data/characters.html).
- [Java Class Charset](https://docs.oracle.com/javase/8/docs/api///?java/nio/charset/Charset.html).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit pedantic, but these don't follow the style guide for references. Examples:

- Java API Specification: [Pattern.MULTILINE](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/regex/Pattern.html#MULTILINE)
- Wikipedia: [Template method pattern](https://en.wikipedia.org/wiki/Template_method_pattern).

Copy link
Contributor

@subatoi subatoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good—only trivial comments 👍

String foo3Spacebar = "foo bar"; // COMPLIANT
```

## Implementation Notes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Implementation Notes
## Implementation notes


## Recommendation

To avoid issues, use the encoded versions of control characters (e.g., ASCII `\n`, `\t`, or Unicode `U+000D`, `U+0009`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To avoid issues, use the encoded versions of control characters (e.g., ASCII `\n`, `\t`, or Unicode `U+000D`, `U+0009`).
To avoid issues, use the encoded versions of control characters (e.g. ASCII `\n`, `\t`, or Unicode `U+000D`, `U+0009`).

@@ -0,0 +1,111 @@
## Overview

This query detects non-explicit control and whitespace characters in java literals.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This query detects non-explicit control and whitespace characters in java literals.
This query detects non-explicit control and whitespace characters in Java literals.

## Recommendation

To avoid issues, use the encoded versions of control characters (e.g., ASCII `\n`, `\t`, or Unicode `U+000D`, `U+0009`).
This makes the literals (e.g., string literals) more readable, and also helps to make the surrounding code less error-prone and more maintainable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This makes the literals (e.g., string literals) more readable, and also helps to make the surrounding code less error-prone and more maintainable.
This makes the literals (e.g. string literals) more readable, and also helps to make the surrounding code less error-prone and more maintainable.


## Example

The following examples illustrate `NON_COMPLIANT` and `COMPLIANT` code:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are "NON_COMPLIANT" and "COMPLIANT" styled in monospace here (and in the headers) for a specific reason? Otherwise I'd just style them normally, i.e. "Non compliant" and "Compliant". Or use the standard query language format of "GOOD" and "BAD". Otherwise, just ignore this 👍


## Implementation Notes

This query detects java literals that contain reserved control characters and/or non-printable whitespace characters, such as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This query detects java literals that contain reserved control characters and/or non-printable whitespace characters, such as:
This query detects Java literals that contain reserved control characters and/or non-printable whitespace characters, such as:

| `\u001F` | 31 | unit separator |
| `\u007F` | 127 | delete |

- Zero-width Unicode characters (e.g., zero-width space, zero-width joiner), such as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Zero-width Unicode characters (e.g., zero-width space, zero-width joiner), such as:
- Zero-width Unicode characters (e.g. zero-width space, zero-width joiner), such as:

This query detects java literals that contain reserved control characters and/or non-printable whitespace characters, such as:

- Decimal and hexidecimal representations of ASCII control characters (code points 0-8, 11, 14-31, and 127).
- Invisible characters (e.g., zero-width space, zero-width joiner).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Invisible characters (e.g., zero-width space, zero-width joiner).
- Invisible characters (e.g. zero-width space, zero-width joiner).

The following list outlines the _**explicit exclusions from query scope**_:

- any number of simple space characters (`U+0020`, ASCII 32).
- an escape character sequence (e.g., `\t`), or the Unicode equivalent (e.g., `\u0009`), for printable whitespace characters:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- an escape character sequence (e.g., `\t`), or the Unicode equivalent (e.g., `\u0009`), for printable whitespace characters:
- an escape character sequence (e.g. `\t`), or the Unicode equivalent (e.g. `\u0009`), for printable whitespace characters:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Java no-change-note-required This PR does not need a change note ready-for-doc-review This PR requires and is ready for review from the GitHub docs team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants