Dedicated crash handler thread. #21826

neunhoef · 2025-06-27T13:00:05Z

Scope & Purpose

This PR implements a dedicated crash handler thread.
The read signal handler triggers the dedicated thread to
write out useful information before it continues with the crash.
This is useful since a dedicated thread has none of the restrictions
of a signal handler and can for example use system calls.

This allows us to dump a lot more information out at crash time.

[*] 💩 Bugfix
[*] 🍕 New feature

Checklist

…r-thread

goedderz · 2025-06-30T09:55:45Z

lib/CrashHandler/CrashHandler.cpp

+/// @brief dedicated crash handler thread
+std::unique_ptr<std::thread> crashHandlerThread = nullptr;


Should we use a std::jthread instead?

Done, but it does not really make a difference, as far as I see.

I think it's a better default, and there's generally very little reason to use std::thread.

Mostly it will be automatically joined on destruction instead of terminating the process, if it hasn't been joined or detached, and it also allows the use of stop tokens/sources.

goedderz · 2025-06-30T09:58:18Z

lib/CrashHandler/CrashHandler.cpp

+/// @brief atomic variable for coordinating between signal handler and crash
+/// handler thread
+std::atomic<arangodb::CrashHandlerState> crashHandlerState(
+    arangodb::CrashHandlerState::IDLE);
+
+/// @brief dedicated crash handler thread
+std::unique_ptr<std::thread> crashHandlerThread = nullptr;
+
+/// @brief mutex to protect thread creation/destruction
+std::mutex crashHandlerThreadMutex;


I think these variables should be tied together, with crashHandlerThreadFunction(), either into a common struct or a namespace.

They are collected in a common anonymous namespace.

goedderz · 2025-06-30T10:10:09Z

lib/CrashHandler/CrashHandler.cpp

+  while (true) {
+    arangodb::CrashHandlerState state =
+        crashHandlerState.load(std::memory_order_acquire);


Note that, since C++20, a std::atomic has methods wait(T value), notify_one() and notify_all(). And if I read https://en.cppreference.com/w/cpp/utility/program/signal correctly, calling these from the signal handler should be OK, as long as the atomic is lock free (which it has to be anyway). I think it's covered by "f is a non-static member function invoked on an object obj, such that obj.is_lock_free() yields true". But please double-check if you agree.

That would simplify this code and the wait function in the crash handler, and make it more polite/efficient.

In addition, if we use std::jthread (see my other comment), we can use stop token handling to address the thread shutdown instead of the atomic. This might be nicer to use (also see my comment on shutdown) and separate the concerns a little.

goedderz · 2025-06-30T10:15:16Z

lib/CrashHandler/CrashHandler.cpp

+    // Signal the dedicated crash handler thread (force trigger even if not
+    // idle)
+    ::crashHandlerState.store(arangodb::CrashHandlerState::CRASH_DETECTED,
+                              std::memory_order_release);


I would vote for both commenting and asserting for everything we do here that we're allowed to do it in the signal handler (or that it's not allowed and we're doing it anyway). So here I'd add

static_assert(decltype(::crashHandlerState)::is_always_lock_free);

or so, and a comment that the store is allowed under this condition.

goedderz · 2025-06-30T10:21:34Z

lib/CrashHandler/CrashHandler.cpp

+    // Log signal-specific information in the signal handler
    logCrashInfo("signal handler invoked", signal, info, ucontext);
+    // Morally, we should do this after the work of the crash handler thread,
+    // however, for backwards compatibility, we keep it here as first thing.


I think it actually preferable to do it in the crash handler thread as much as possible, to reduce or eliminate the amount of UB we have in the crash handler, at least the actual logging.

goedderz · 2025-06-30T10:23:18Z

lib/CrashHandler/CrashHandler.cpp

+    // Now log the backtrace from the crashed thread (only the crashed thread
+    // can do this)
    logBacktrace();
-    logProcessInfo();
+
+    // Final cleanup
    arangodb::Logger::flush();
    arangodb::Logger::shutdown();


Same here, I think it would be preferable to do this in the crash handler thread as much as possible (would need some refactoring/splitting of logBacktrace()).

goedderz · 2025-06-30T10:43:47Z

lib/CrashHandler/CrashHandler.h

+  /// @brief shuts down the crash handler thread
+  static void shutdownCrashHandler();


I don't think its workable to have a function that must be called before each call to exit or similar. As of now, you've addressed an exit call in arangod.cpp (there is even a second one there) - and how does this affect the client-tools? And you've addressed to exit()s in the VersionFeature, but it seems we have at least a dozen other places calling exit.

I haven't thought it through, so there may be problems, but some ideas how we might address this:

If we move the crashHandlerThread variable in a struct, we might be able to stop the thread in that thread's destructor

We could install handlers with std::atexit and std::at_quick_exit (I don't think std::set_terminate is needed?)

Don't install the thread globally, but with a scoped variable that we put in runServer, probably where we call CrashHandler::installCrashHandler(). It might then make sense to move installCrashHandler() into the constructor of the object, and maybe even remove the handlers in the destructor - that would clarify exactly in which scope the CrashHandler, including its handlers and thread, is active.

Intuitively I like the third option best, and the second the least.

I will try to do this as much as possible. Something with static lifetime will be necessary, or else the signal handler (which can be executed any time from any thread) and the std::terminate handler (which is executed on assertion failure or other catastrophic events, again essentially any time in any thread) do not have access to it. So making it into a totally scoped variable will not be possible.
I will try to avoid the explicit shutdown by additionally creating a scoped CrashHandler object (which can do the normal initialization and shutdown). But an atexit handler will be needed anyway, since we have these places which answer to --version, which use exit to stop the process right away. In this case the scoped variable is no longer cleaned up properly. Let's see what I can come up with.

Done as discussed.

goedderz · 2025-06-30T10:44:33Z

lib/CrashHandler/CrashHandler.h

+/// @brief States for the crash handler thread coordination
+enum class CrashHandlerState : int {
+  IDLE = 0,               ///< idle state, waiting for crashes
+  CRASH_DETECTED = 1,     ///< crash detected, handling in progress
+  HANDLING_COMPLETE = 2,  ///< crash handling complete
+  SHUTDOWN = 3            ///< shutdown requested
+};


If this follows (only) specific state transitions, these should be noted in a comment.

…r-thread

goedderz · 2025-07-02T14:07:42Z

lib/CrashHandler/CrashHandler.cpp

+std::unique_ptr<char[]> backtraceBuffer;
+constexpr size_t backtraceBufferSize = 1024 * 1024;  // 1MB


You could use a std::array instead of a char[]. That would encode the length in the type, and thus make it slightly easier to access. Not a necessity though.

And not sure if there's a noteworthy tradeoff, but we could also omit the heap allocation and use static memory instead, e.g.

Suggested change

std::unique_ptr<char[]> backtraceBuffer;

constexpr size_t backtraceBufferSize = 1024 * 1024; // 1MB

constexpr size_t backtraceBufferSize = 1024 * 1024; // 1MB

std::array<char, backtraceBufferSize> backtraceBuffer;

Do you think it matters? Don't change it unless you prefer it, it's just a thought, I have no preference.

goedderz · 2025-07-02T14:13:10Z

lib/CrashHandler/CrashHandler.cpp


 /// @brief mutex to protect thread creation/destruction
 std::mutex crashHandlerThreadMutex;

+// Static variables to store crash data for the dedicated crash handler thread
+/// @brief stores the crash context/reason
+static char const* crashContext = nullptr;


Maybe use a std::string_view instead of a char*? The callers have one available anyway, as far as I can see.

goedderz · 2025-07-02T15:31:31Z

lib/CrashHandler/CrashHandler.h

+  std::atomic<bool> _threadRunning{false};
+  static std::atomic<CrashHandler*> _theCrashHandler;


Can we omit _threadRunning, and do a CAS on _theCrashHandler instead?

goedderz · 2025-07-02T15:35:06Z

lib/CrashHandler/CrashHandler.cpp

+  // We intentionally do not protect against two signal handlers running
+  // concurrently. We could do this using atomics, but choose not to
+  // for simplicity.


I think it would be simple enough to protect against this:
E.g., change crashSignal to be an atomic, set it first but with a CAS, and set the rest only if that succeeded.

Actually, I've just seen that this is handled by ::crashHandlerInvoked.exchange(true) in crashHandlerSignalHandler - so maybe just clarify that in the comment?

goedderz · 2025-07-02T15:51:03Z

lib/CrashHandler/CrashHandler.cpp

+  SmallString headerBuffer;
+  headerBuffer.append("Backtrace of thread ");
+  headerBuffer.appendUInt64(arangodb::Thread::currentThreadNumber());
+  headerBuffer.append(" [").append(currentThreadName).append("]\n");

-  LOG_TOPIC("c962b", INFO, arangodb::Logger::CRASH) << buffer.view();
+  if (!safeAppend(headerBuffer.view())) {
+    return totalWritten;
+  }


Maybe wrap it in a {} block, to make it clear that we don't add up too many of those 4kb buffers on the stack. Though the optimizer might catch that anyway...

goedderz · 2025-07-02T15:52:06Z

lib/CrashHandler/CrashHandler.cpp

+/// in a signal handler. Use `acquireBacktrace` above to get the backtrace
+/// in the signal handler and then this function in the dedicated
+/// CrashHandler thread to do the actual logging!
+void logAcquiredBacktrace(char const* buffer, size_t bufferSize) try {


Maybe pass a std::string_view instead of char* + size_t?

goedderz · 2025-07-02T16:04:26Z

lib/CrashHandler/CrashHandler.cpp

+  size_t pos = 0;
+  size_t lineStart = 0;
+
+  while (pos <= bufferSize) {
+    if (pos == bufferSize || buffer[pos] == '\n') {
+      if (pos > lineStart) {
+        std::string_view line(buffer + lineStart, pos - lineStart);


If you change buffer into a std::string_view, you can just do

Suggested change

size_t pos = 0;

size_t lineStart = 0;

while (pos <= bufferSize) {

if (pos == bufferSize || buffer[pos] == '\n') {

if (pos > lineStart) {

std::string_view line(buffer + lineStart, pos - lineStart);

for (auto const line : std::views::split(buffer, "\n")) {

instead, if you include <ranges>. :)

goedderz · 2025-07-02T16:29:12Z

lib/CrashHandler/CrashHandler.cpp

+  } catch (...) {
+    // could not allocate memory for backtrace buffer.
+    // this is not critical - we can still work without it
+    ::backtraceBuffer.release();
+  }


Maybe add a log message? Seems unlikely, but may confuse us very much without one if it does happen.

goedderz · 2025-07-02T16:32:22Z

lib/CrashHandler/CrashHandler.cpp

@@ -685,6 +840,20 @@ void CrashHandler::installCrashHandler() {

    CrashHandler::crash(buffer.view());
  });
+
+  std::atexit([]() {
+    CrashHandler* ch = CrashHandler::_theCrashHandler;


Maybe do a

Suggested change

CrashHandler* ch = CrashHandler::_theCrashHandler;

auto* ch = CrashHandler::_theCrashHandler.exchange(nullptr);

instead? And similarly in std::at_quick_exit and ~CrashHandler(). That could avoid races between ~CrashHandler() and the exit handlers.

First draft and experiment.

c1d34c2

neunhoef added this to the devel milestone Jun 27, 2025

neunhoef self-assigned this Jun 27, 2025

cla-bot bot added the cla-signed label Jun 27, 2025

neunhoef added 7 commits June 27, 2025 15:24

Clean up code.

a0e27a6

CHANGELOG.

abc4216

Take out test assertion again.

4a0e661

Increase backwards compatibility of the output.

27fb623

Shut down crash handler thread when --version is used.

1cc8b35

Merge remote-tracking branch 'origin/devel' into feature/crash-handle…

4d39c2c

…r-thread

Merge remote-tracking branch 'origin/devel' into feature/crash-handle…

5b45310

…r-thread

goedderz requested changes Jun 30, 2025

View reviewed changes

neunhoef added 12 commits June 30, 2025 14:59

Use atomic wait/notify_all for crash handler thread.

3704b8e

Merge remote-tracking branch 'origin/devel' into feature/crash-handle…

dded8d4

…r-thread

Explain state transitions.

f308917

Cleaner shutdown.

b5950e6

Take out commented out code.

62e81ff

clang-format.

1f33096

First phase of moving stuff to dedicated crash handler thread.

70b04f0

Rework suggested by reviewer.

ebf3c43

Take assertion out.

fb9813c

Remove additional string for backwards compatibilty.

4b8523d

Add CrashHandler as thread name.

c09a7c6

Thread exceptions.

c4d6bea

goedderz reviewed Jul 2, 2025

View reviewed changes

		/// @brief dedicated crash handler thread
		std::unique_ptr<std::thread> crashHandlerThread = nullptr;

		/// @brief shuts down the crash handler thread
		static void shutdownCrashHandler();

		std::unique_ptr<char[]> backtraceBuffer;
		constexpr size_t backtraceBufferSize = 1024 * 1024; // 1MB

		std::atomic<bool> _threadRunning{false};
		static std::atomic<CrashHandler*> _theCrashHandler;

	CrashHandler* ch = CrashHandler::_theCrashHandler;
	auto* ch = CrashHandler::_theCrashHandler.exchange(nullptr);

Dedicated crash handler thread. #21826

Are you sure you want to change the base?

Dedicated crash handler thread. #21826

Uh oh!

Conversation

neunhoef commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scope & Purpose

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

neunhoef commented Jun 27, 2025 •

edited

Loading