Introduction

Core dump generation on Linux and other non-Windows platforms has several challenges. Dumps can be very large and the default name/location of a dump is not consistent across all our supported platforms. The size of a full core dumps can be controlled somewhat with the "coredumpfilter" file/flags but even with the smallest settings may be still too large and may not contain all the managed state needed for debugging. By default, some platforms use _core as the name and place the core dump in the current directory from where the program is launched; others add the pid to the name. Configuring the core name and location requires superuser permission. Requiring superuser to make this consistent is not a satisfactory option.

Our goal is to generate core dumps that are on par with WER (Windows Error Reporting) crash dumps on any supported Linux platform. To the very least we want to enable the following:

automatic generation of minimal size minidumps. The quality and quantity of the information contained in the dump should be on par with the information contained in a traditional Windows mini-dump.
simple configurabilty by the user (not su!).

Our solution at this time is to intercept any unhandled exception in the PAL layer of the runtime and have coreclr itself trigger and generate a "mini" core dump.

Design

We looked at the existing technologies like Breakpad and its derivatives (e.g.: an internal MS version called msbreakpad from the SQL team....). Breakpad generates Windows minidumps but they are not compatible with existing tools like Windbg, etc. Msbreakpad even more so. There is a minidump to Linux core conversion utility but it seems like a wasted extra step. Breakpad does allow the minidump to be generated in-process inside the signal handlers. It restricts the APIs to what was allowed in a "async" signal handler (like SIGSEGV) and has a small subset of the C++ runtime that was also similarly constrained. We also need to add the set of memory regions for the "managed" state which requires loading and using the DAC's (*) enumerate memory interfaces. Loading modules is not allowed in an async signal handler but forking/execve is allowed so launching an utility that loads the DAC, enumerates the list of memory regions and writes the dump is the only reasonable option. It would also allow uploading the dump to a server too.

* The DAC is a special build of parts of the coreclr runtime that allows inspection of the runtime's managed state (stacks, variables, GC state heaps) out of context. One of the many interfaces it provides is ICLRDataEnumMemoryRegions which enumerates all the managed state a minidump would require to enable a fuitful debugging experience.

Breakpad could have still been used out of context in the generation utility but there seemed no value to their Windows-like minidump format when it would have to be converted to the native Linux core format away because in most scenarios using the platform tools like lldb is necessary. It also adds a coreclr build dependency on Google's Breakpad or SQL's msbreakpad source repo. The only advantage is that the breakpad minidumps may be a little smaller because minidumps memory regions are byte granule and Linux core memory regions need to be page granule.

Implementation Details

Linux

Core dump generation is triggered anytime coreclr is going to abort (via PROCAbort()) the process because of an unhandled managed exception or an async signal like SIGSEGV, SIGILL, SIGFPE, etc. The createdump utility is located in the same directory as libcoreclr.so and is launched with fork/execve. The child createdump process is given permission to ptrace and access to the various special /proc files of the crashing process which waits until createdump finishes.

The createdump utility starts by using ptrace to enumerate and suspend all the threads in the target process. The process and thread info (status, registers, etc.) is gathered. The auxv entries and DSO info is enumerated. DSO is the in memory data structures that described the shared modules loaded by the target. This memory is needed in the dump by gdb and lldb to enumerate the shared modules loaded and access their symbols. The module memory mappings are gathered from /proc/$pid/maps. None of the program or shared modules memory regions are explicitly added to dump's memory regions. The DAC is loaded and the enumerate memory region interfaces are used to build the memory regions list just like on Windows. The threads stacks and one page of code around the IP are added. The byte sized regions are rounded up to pages and then combined into contagious regions.

All the memory mappings from /proc/$pid/maps are in the PT_LOAD sections even though the memory is not actually in the dump. They have a file offset/size of 0.

After all the process crash information has been gathered, the ELF core dump with written. The main ELF header created and written. The PT_LOAD note section is written one entry for each memory region in the dump. The process info, auxv data and NTFILE entries are written to core. The NT_FILE entries are built from module memory mappings from /proc/$pid/maps. The threads state and registers are then written. Lastly all the memory regions gather above by the _DAC, etc. are read from the target process and written to the core dump. All the threads in the target process are resumed and createdump terminates.

Severe memory corruption

As long as control can making it to the signal/abort handler and the fork/execve of the utility succeeds then the DAC memory enumeration interfaces can handle corruption to a point; the resulting dump just may not have enough managed state to be useful. We could investigate detecting this case and writing a full core dump.

Stack overflow exception

Like the severe memory corruption case, if the signal handler (SIGSEGV) gets control it can detect most stack overflow cases and does trigger a core dump. There are still many cases where this doesn't happen and the OS just terminates the process.

FreeBSD/OpenBSD/NetBSD

There will be some differences gathering the crash information but these platforms still use ELF format core dumps so that part of the utility should be much different. The mechanism used for Linux to give createdump permission to use ptrace and access the /proc doesn't exists on these platforms.

OS X

Gathering the crash information on OS X will be quite a bit different than Linux and the core dump will be written in the Mach-O format instead of ELF. The OS X support currently has not been implemented.

Configuration/Policy

Any configuration or policy is set with environment variables which are passed as options to the createdump utility.

Environment variables supported:

COMPlus_DbgEnableMiniDump: if set to "1", enables this core dump generation. The default is NOT to generate a dump.
COMPlus_DbgMiniDumpType: if set to "1" generates MiniDumpNormal, "2" MiniDumpWithPrivateReadWriteMemory, "3" MiniDumpFilterTriage, "4" MiniDumpWithFullMemory. Default is MiniDumpNormal.
COMPlus_DbgMiniDumpName: if set, use as the template to create the dump path and file name. The pid can be placed in the name with %d. The default is /tmp/coredump.%d.
COMPlus_CreateDumpDiagnostics: if set to "1", enables the createdump utilities diagnostic messages (TRACE macro).

(Please refer to MSDN for the meaning of the minidump enum values.aspx) reported above)

Utility command line options:

createdump [options] pid
-f, --name - dump path and file name. The pid can be placed in the name with %d. The default is "/tmp/coredump.%d"
-n, --normal - create minidump (default).
-h, --withheap - create minidump with heap.
-t, --triage - create triage minidump.
-u, --full - create full core dump.
-d, --diag - enable diagnostic messages.

Testing

The test plan is to modify the SOS tests in the (still) private debuggertests repo to trigger and use the core minidumps generated. Debugging managed core dumps on Linux is not supported by mdbg at this time until we have a ELF core dump reader so only the SOS tests (which use lldb on Linux) will be modified.

Open Issues

May need more than just the pid for decorating dump names for docker containers because I think the pid is always 1.
Do we need all the memory mappings from /proc/$pid/maps in the PT_LOAD sections even though the memory is not actually in the dump? They have a file offset/size of 0. Full dumps generated by the system or gdb do have these un-backed regions.
There is no way to get the signal number, etc. that causes the abort from the createdump utility using ptrace or a /proc file. It would have to be passed from CoreCLR on the command line.
Do we need the "dynamic" sections of each shared module in the core dump? It is part of the "linkmap" entry enumerated when gathering the _DSO information.
There may be more versioning and/or build id information needed to be added to the dump.
It is unclear exactly what cases stack overflow does not get control in the signal handler and when the OS just aborts the process.

Cross-platform Minidumps