Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implode: Better invalid input validation and handling #2646

Merged
merged 1 commit into from
Jul 25, 2023

Conversation

wader
Copy link
Member

@wader wader commented Jul 2, 2023

Error on non-number codepoint, asserted before
Replace negative codepoint and surrogate range with unicode replacement character, asserted before

Fixes #1160

@wader
Copy link
Member Author

wader commented Jul 2, 2023

TODO:

  • How to handle surrogate range, error? it's a utf16 thing i think? have to read up on the details => replacement
  • Error instead of replacement character for negative? current we replace for > 0x10ffff so felt logical to do the same for < 0 and surrogates => replacement
  • Better/different error messages for type errors?

src/builtin.c Outdated Show resolved Hide resolved
src/builtin.c Outdated Show resolved Hide resolved
@pkoppstein
Copy link
Contributor

@wader - My (perhaps flawed) understanding has been that one of the reasons why jq's implode has relied on assertion checking is that it's easy to turn assertion checking off, so anyone for whom a need for speed trumps other concerns can achieve it.

In any case, I think it would helpful to track the impact on speed of updates to implode.

One reasonable approach would be to take a standard JSON test file (e.g. jeopardy.json - see https://github.com/jqlang/jq/wiki/X---Experimental-Benchmarks) and treat it as raw text.

jqMaster:

< jeopardy.json time jq -Rs 'explode | implode | length'
55554625
        2.77 real         2.35 user         0.39 sys

jqMaster with NDEBUG

< jeopardy.json time jq.NDEBUG -Rs 'explode | implode | length'
55554625
        2.65 real         2.23 user         0.38 sys

@wader
Copy link
Member Author

wader commented Jul 3, 2023

@wader - My (perhaps flawed) understanding has been that one of the reasons why jq's implode has relied on assertion checking is that it's easy to turn assertion checking off, so anyone for whom a need for speed trumps other concerns can achieve it.

In this case i think it's a bug, at least the cli should never assert like this. By reading the code i looks like the idea is that the builtins wrapper functions f_* in builtins.c like f_string_implode are the ones that take care of most of the "user facing" error handling and then call more generic and reusable functions like jv_string_implode that assumes (but asserts) that they get valid input. But it seems to be a bit of a mix, so maybe for performance reasons it would be ok to return jv errors from jv_string_implode to skip iterating twice.

(also i said util.c which is wrong, jv_string_implode is in jv.c)

In any case, I think it would helpful to track the impact on speed of updates to implode.

One reasonable approach would be to take a standard JSON test file (e.g. jeopardy.json - see https://github.com/jqlang/jq/wiki/X---Experimental-Benchmarks) and treat it as raw text.

jqMaster:

< jeopardy.json time jq -Rs 'explode | implode | length'
55554625
        2.77 real         2.35 user         0.39 sys

jqMaster with NDEBUG

< jeopardy.json time jq.NDEBUG -Rs 'explode | implode | length'
55554625
        2.65 real         2.23 user         0.38 sys

Yes track that is a good idea

tests/jq.test Outdated

map(try implode catch .)
[123,["a"]]
["implode input must be an array","string (\"a\") codepoint must be a number"]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe should be formulated as string ("a") is not a number codepoint etc?

@wader wader changed the title implode: Better implode input validation implode: Better invalid input validation and handling Jul 3, 2023
@wader
Copy link
Member Author

wader commented Jul 3, 2023

Rebased and now also replaces surrogate pair range with replacement character, same as gojq. But i noticed now that gojq throws error on negative and > 1114111

@itchyny itchyny added this to the 1.7 release milestone Jul 5, 2023
@itchyny itchyny added the bug label Jul 5, 2023
src/builtin.c Outdated Show resolved Hide resolved
@wader
Copy link
Member Author

wader commented Jul 15, 2023

What do ppl think about invalid and surrogate pair numbers? error or replacement?

@nicowilliams
Copy link
Contributor

nicowilliams commented Jul 16, 2023

What do ppl think about invalid and surrogate pair numbers? error or replacement?

I think that asking questions like that is sure to cause strenuous debate. Somehow it didn't -- maybe no one noticed you asking. 😆

Anyways, my preference would be replacement (think of the Postel principle), but wait! no! error! (because the Postel principle causes more troubles than it saves, and because we want to be able to validate JSON with just jq -e ., no?). A strenuous debate in my own mind. Ugh.

EDIT: Well, if we don't document this then we're free to change the behavior later. I say go for replacement for now.

@wader wader force-pushed the implode-codepoint-check branch 3 times, most recently from 4fdab77 to 77dba74 Compare July 24, 2023 11:41
@wader
Copy link
Member Author

wader commented Jul 24, 2023

What do ppl think about invalid and surrogate pair numbers? error or replacement?

I think that asking questions like that is sure to cause strenuous debate. Somehow it didn't -- maybe no one noticed you asking. 😆

Anyways, my preference would be replacement (think of the Postel principle), but wait! no! error! (because the Postel principle causes more troubles than it saves, and because we want to be able to validate JSON with just jq -e ., no?). A strenuous debate in my own mind. Ugh.

EDIT: Well, if we don't document this then we're free to change the behavior later. I say go for replacement for now.

😄 Ok! that is the current behaviour in the PR.

@itchyny not sure if you want to adjust gojq to behave the same, current it throws error for > 1114111 and < 0 on implode.

@wader
Copy link
Member Author

wader commented Jul 24, 2023

Not sure what is going on with github, pushed new commit but still says conflict and no actions have run yet... i guess we wait a bit

@emanuele6
Copy link
Member

emanuele6 commented Jul 24, 2023

$ gh pr checkout 2646
From https://github.com/jqlang/jq
 * branch            refs/pull/2646/head -> FETCH_HEAD
Already up to date.
$ git rebase master
Auto-merging tests/jq.test
CONFLICT (content): Merge conflict in tests/jq.test
error: could not apply 77dba74... implode: Better invalid input validation and handling
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
Could not apply 77dba74... implode: Better invalid input validation and handling

@wader github is reporting a legitimate conflict: this PR is based on 98f709d, and after that commit, 1cf6515 on master adds lines at the end of tests/jq.test. That commit conflicts with yours since your commit also adds lines at the end of tests/jq.test.

Basically, git can't tell if you want to add the lines after the previous end of the file, and before the lines added by 1cf6515, or if you want to add the new lines after the new end of the file.

@wader
Copy link
Member Author

wader commented Jul 24, 2023

Basically, git can't tell if you want to add the lines after the previous end of the file, and before the lines added by 1cf6515, or if you want to add the new lines after the new end of the file.

Ah thanks, that makes sense. Think got confused that i fixed the conflict in jq.test and pushed a while later and there was more commits to master cause a conflict in jq.test again :)

@wader wader force-pushed the implode-codepoint-check branch 2 times, most recently from c1ac128 to 6fbf341 Compare July 24, 2023 12:54
src/builtin.c Show resolved Hide resolved
@nicowilliams
Copy link
Contributor

Assertions in src/execute.c, yes. Assertions in src/builtin.c, not so much, yeah.

@wader wader force-pushed the implode-codepoint-check branch 2 times, most recently from 100c3f7 to 80d8e2f Compare July 24, 2023 21:38
src/builtin.c Outdated Show resolved Hide resolved
@wader
Copy link
Member Author

wader commented Jul 25, 2023

Think this is ready now, only thing i don't like is the error message "string ("a") codepoint must be a number sounds a bit weird or?

@nicowilliams
Copy link
Contributor

Think this is ready now, only thing i don't like is the error message "string ("a") codepoint must be a number sounds a bit weird or?

How about "only numeric Unicode codepoints may be imploded"? Or something to that effect. The point being to give the user a clue that the error has to do with implode.

@nicowilliams
Copy link
Contributor

And yes, modulo any improvement you want to make to that error message, this is ready!

Error on non-number and nan codepoint, would asserd before
Replace negative codepoint and surrogate range with unicode replacement character, would assert before

Fixes jqlang#1160
@wader
Copy link
Member Author

wader commented Jul 25, 2023

How about "only numeric Unicode codepoints may be imploded"? Or something to that effect. The point being to give the user a clue that the error has to do with implode.

Changed to string ("a") can't be imploded, unicode codepoint needs to be numeric". Errors seem to be lowercase usually?

On the topic of errors: it would be great if more error messages had some kind of "source" prefix etc like function name. Now a days i think i've learned to recognize most commcon errors to know where they usually come from :) how hard would a stack trace be with source location etc? need to track more things? sorry for derailing, maybe should open an issue about it to collect stuff?

@nicowilliams
Copy link
Contributor

how hard would a stack trace be with source location etc? need to track more things? sorry for derailing, maybe should open an issue about it to collect stuff?

Oh, I've wanted a stack trace, yeah. It would have to have column numbers too, not just line numbers.

@nicowilliams nicowilliams merged commit a949745 into jqlang:master Jul 25, 2023
28 checks passed
@nicowilliams
Copy link
Contributor

Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Core dumped on implode
5 participants