Comparing changes

Since we usually look for a comment char while parsing line-oriented files, setting core.commentChar to a single newline can confuse our code quite a bit. For example, using it with "git commit" causes us to fail to recognize any of the template as comments, including it in the config message. Which kind of makes sense, since the template content is on its own line (so no line can "start" with a newline). In other spots I would not be surprised if you can create more mischief (e.g., violating loop assumptions) but I didn't dig into it. Since comment characters are a local preference, to some degree this is a case of "if it hurts, don't do it". But given that this would be a silly and pointless thing to do, and that it makes it harder to reason about code parsing comment lines, let's just forbid it. There are other cases that are perhaps questionable (e.g., setting the comment char to a single space), but they seem to behave reasonably (at least a simple "git commit" will correctly identify and strip the template lines). So I haven't worried about going on a hunt for every stupid thing a user might do to themselves, and just focused on the most confusing case. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

In strbuf_add_commented_lines(), we prepare two strings with potential prefixes: one with just the comment char, and one with an additional space. In the add_lines() helper, we use the one without the extra space for blank lines or lines starting with a tab. While passing in two separate prefixes to the helper is very flexible, it's more flexibility than we actually use (or are likely to use, since the rules inside add_lines() only make sense if "prefix2" is a variant of "prefix1" without the extra space). And setting up the two strings makes refactoring in strbuf_add_commented_lines() awkward. Instead, let's pass in a single string, and just let add_lines() add the extra space to the result as appropriate. We do still need to pass in a flag to trigger this behavior. The helper is shared by strbuf_add_lines(), which passes in a NULL "prefix2" to inhibit this extra handling. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

In strbuf_add_commented_lines(), we have to convert the single-byte comment_line_char into a string to pass to add_lines(). We cache the created string using a static-local variable. But this makes the function non-reentrant, and it's doubtful that this provides any real performance benefit given that we know the string always contains a single character. So let's just create it from scratch each time, and to give the compiler the maximal opportunity to make it fast we'll ditch the over-complicated xsnprintf() and just assign directly into the array. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

When core.commentChar is set to "auto", we check a set of candidate characters against the proposed buffer to see which if any can be used without ambiguity. But before we do that, we optimize for the common case that the default "#" is fine by just seeing if it is present in the buffer at all. The way we do this is a bit subtle, though: we assign the candidate character to comment_line_char preemptively, then check if it works, and return if it does. The subtle part is that sometimes setting comment_line_char is important (after we return, the important outcome is the fact that we have set the variable) and sometimes it is useless (if our optimization fails, we go on to do the more careful checks and eventually assign something else instead). To make it more clear what is happening (and to make further refactoring of comment_line_char easier), let's check our candidate character directly, and then assign as part of returning if it worked out. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Several comment-related strbuf functions take a comment_line_char parameter. There's also a global comment_line_char variable, which is closely related (most callers pass it in as this parameter). Let's avoid shadowing the global name. This makes it more obvious that we're not using the global value, and it will be especially helpful as we refactor the global in future patches (in particular, any macro trickery wouldn't work because the preprocessor doesn't respect scope). We'll use "comment_prefix". That should be descriptive enough, and as a bonus is more neutral with respect to the "char" type (since we'll eventually swap it out for a string). Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

We'd like to eventually support multi-byte comment prefixes, but the comment_line_char variable is referenced in many spots, making the transition difficult. Let's start by storing the character in a NUL-terminated string. That will let us switch code over incrementally to the string format, and we can easily support the existing code with a macro wrapper (since we'll continue to allow only a single-byte prefix, this will behave identically). Once all references to the "char" variable have been converted, we can drop it and enable longer strings. We'll still have to touch all of the spots that create or set the variable in this patch, but there are only a few (reading the config, and the "auto" character selector). Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

As part of our transition to multi-byte comment characters, let's take a NUL-terminated string pointer for strbuf_stripspace(), rather than a single character. We can continue to support its feature of ignoring comments by accepting a NULL pointer (as opposed to the current behavior of a NUL byte). All of the callers have to be adjusted, but they can all just pass comment_line_str (or NULL). Inside the function we detect comments by comparing the first byte of a line to the comment character. We'll adjust that to use starts_with(), which will match multiple bytes (though for now, of course, we still only allow a single byte, so it's academic). Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

As part of our transition to multi-byte comment characters, let's take a NUL-terminated string pointer for strbuf_commented_addf() rather than a single character. All of the callers have to be adjusted, but they can just pass comment_line_str rather than comment_line_char. Note that we rely on strbuf_add_commented_lines() under the hood, so we'll cheat a bit to squeeze our string into a single character (for now the two are equivalent, and we'll address this TODO in the next patch). Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

As part of our transition to multi-byte comment characters, let's take a NUL-terminated string pointer for strbuf_add_commented_lines() rather than a single character. All of the callers have to be adjusted; most can just pass comment_line_str rather than comment_line_char. And now our "cheat" in strbuf_commented_addf() can go away, as we can take the full string from it. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

As part of our transition to multi-byte comment characters, we should use the string variable rather than the historical character variable. All of the sites adjusted here are just swapping out "%c" for "%s" in format strings, or strbuf_addch() for strbuf_addstr(). The type system and printf-attribute give the compiler enough information to make sure our formats and variable changes all match (especially important for cases where the format string is defined far away from its use, like prepare_to_commit() in commit.c). Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Several parts of the code need to identify lines that begin with the comment character, and do so with a simple byte equality check. As part of the transition to handling multi-byte characters, we need to match all of the bytes. For cases where we are looking in a NUL-terminated string, we can just use starts_with(), which checks all of the characters in comment_line_str. Note that we can drop the "line.len" check in wt-status.c's read_rebase_todolist(). The starts_with() function handles the case of an empty haystack buffer (it will always return false for a non-empty prefix). Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

As with the previous patch, we need to swap out single-byte matching for something like starts_with() to match all bytes of a multi-byte comment character. But for cases where the buffer is not NUL-terminated (and we instead have an explicit size or end pointer), it's not safe to use starts_with(), as it might walk off the end of the buffer. Let's introduce a new starts_with_mem() that does the same thing but also accepts the length of the "haystack" str and makes sure not to walk past it. Note that in most cases the existing code did not need a length check at all, since it was written in a way that knew we had at least one byte available (and that was all we checked). So I had to read each one to find the appropriate bounds. The one exception is sequencer.c's add_commented_lines(), where we can actually get rid of the length check. Just like starts_with(), our starts_with_mem() handles an empty haystack variable by not matching (assuming a non-empty prefix). A few notes on the implementation of starts_with_mem(): - it would be equally correct to take an "end" pointer (and indeed, many of the callers have this and have to subtract to come up with the length). I think taking a ptr/size combo is a more usual interface for our codebase, though, and has the added benefit that the function signature makes it harder to mix up the three parameters. - we could obviously build starts_with() on top of this by passing strlen(str) as the length. But it's possible that starts_with() is a relatively hot code path, and it should not pay that penalty (it can generally return an answer proportional to the size of the prefix, not the whole string). - it naively feels like xstrncmpz() should be able to do the same thing, but that's not quite true. If you pass the length of the haystack buffer, then strncmp() finds that a shorter prefix string is "less than" than the haystack, even if the haystack starts with the prefix. If you pass the length of the prefix, then you risk reading past the end of the haystack if it is shorter than the prefix. So I think we really do need a new function. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

We already match multi-byte comment characters in parse_insn_line(), thanks to the previous commit, yielding a TODO_COMMENT entry. But in todo_list_to_strbuf(), we may call command_to_char() to convert that back into something we can output. We can't just return comment_line_char anymore, since it may require multiple bytes. Instead, we'll return "0" for this case, which is the same thing we'd return for a command which does not have a single-letter abbreviation (e.g., "revert" or "noop"). There is only a single caller of command_to_char(), and upon seeing "0" it falls back to outputting the full name via command_to_string(). So we can handle TODO_COMMENT there, returning the full string. Note that there are many other callers of command_to_string(), which will now behave differently if they pass TODO_COMMENT. But we would not expect that to happen; prior to this commit, the function just calls die() in this case. And looking at those callers, that makes sense; e.g., do_pick_commit() will only be called when servicing a pick command, and should never be called for a comment in the first place. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

In wt_longstatus_print_tracking() we may conditionally show a comment prefix based on the wt_status->display_comment_prefix flag. We handle that by creating a local "comment_line_string" that is either the empty string or the comment character followed by a space. For a single-byte comment, the maximum length of this string is 2 (plus a NUL byte). But to handle multi-byte comment characters, it can be arbitrarily large. One way to handle this is to just call xstrfmt("%s ", comment_line_str), and then free it when we're done. But we can simplify things further by just conditionally switching between our prefix string and an empty string when formatting. We couldn't just do that with the previous code, because the comment character was a single byte. There's no way to have a "%c" format switch between some character and "no character at all". Whereas with "%s" you can switch between some string and the empty string. So now that we have a comment string and not a comment char, we can just use it directly when formatting. Do note that we have to also conditionally add the trailing space at the same time. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

There is no longer any code which references the single-byte comment_line_char. Let's drop it, clearing the way for true multi-byte entries in comment_line_str. It's possible there are topics in flight that have added new references to comment_line_char. But we would prefer to fail compilation (and then fix it) upon merging with this, rather than have them quietly ignore the bytes after the first. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Now that all of the code handles multi-byte comment characters, it's safe to allow users to set them. There is one special case I kept: we still will not allow an empty string for the commentChar. While it might make sense in some contexts (e.g., output where you don't want any comment prefix), there are plenty where it will behave badly (e.g., all of our starts_with() checks will indicate that every line is a comment!). It might be reasonable to assign some meaningful semantics, but it would probably involve checking how each site behaves. In the interim let's forbid it and we can loosen things later. Likewise, the "commentChar cannot be a newline" rule is now extended to "it cannot contain a newline" (for the same reason: it can confuse our parsing loops). Since comment_line_str is used in many parts of the code, it's hard to cover all possibilities with tests. We can convert the existing double-semicolon prefix test to show that "git status" works. And we'll give it a more challenging case in t7507, where we confirm that git-commit strips out the commit template along with any --verbose text when reading the edited commit message back in. That covers the basics, though it's possible there could be issues in more exotic spots (e.g., the sequencer todo list uses its own code). Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

The core.commentChar code recently learned to accept more than a single ASCII character. But using it is annoying with multiple versions of Git, since older ones will reject it outright: $ git.v2.44.0 -c core.commentchar=foo stripspace -s error: core.commentChar should only be one ASCII character fatal: unable to parse 'core.commentchar' from command-line config Let's add an alias core.commentString. That's arguably a better name anyway, since we now can handle strings, and it makes it possible to have a config that works reasonably with both old and new versions of Git (see the example in the documentation). This is strictly an alias, so there's not much point in adding duplicate tests; I added a single one to t0030 that exercises the alias code. Note also that the error messages for invalid values will now show the variable the config parser handed us, and thus will be normalized to lowercase (rather than camelcase). A few tests in t0030 are adjusted to match. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comparing changes

Open a pull request

Commits on Mar 12, 2024

Commits on Mar 27, 2024

This comparison is taking too long to generate.

Uh oh!