Intro

I found an interesting tweet on Twitter and analyzed it.
This is about how to check if a particular function has completed Maglev compilation without using debug functions.
To summarize, the number of recursive function calls may be different in the Interpreter version and the Maglev compiled version, so this can be used to distinguish them.

Stack Frame Overflow

./jit_test.js:3: RangeError: Maximum call stack size exceeded
    return 1 + getStackDepth();

This method uses RangeError, call stack overflow that appear like this.

function getStackDepth() {
  try {
    return 1 + getStackDepth();
  } catch (e) {                         // [1]
    return 1;
  }
}

function optimizeFunction(fn) {
  for (let i=0; i<10000; i++) {
    fn();
  }
}

function target() {
  x = 1.1 + 2.2;
  return getStackDepth();
}

optimizeFunction(getStackDepth);

const initialDepth = target();          // [2]
console.log(`Initial Stack Depth: ${initialDepth}`);

for(let i=0; i<412; i++) {
                // 408 => marking MAGLEV
                // 409 => compiling MAGLEV
                // 410 => completed compiling
                // 411 => execute compiled code
  if (i % 100 == 0) {
    console.log("===> "+i, target());   // [3]
  }
  else target();
}

const optimizedDepth = target();        // [4]
console.log(`Optimized Stack Depth: ${optimizedDepth}`);

In [1], an exception occurs when the Max stack frame is reached while calling a recursive function.
[2] gets the number of recursive calls in the target function executed as bytecode.
In [3], the target function is continuously called, and at some point it is compiled with Maglev. In my environment, preparation starts when i is 408 and the compiled code is executed when i is 411. Exact timing may vary depending on environment.
[4] gets the number of recursive calls in the compiled target function.

execution result:

Initial Stack Depth: 17991
===> 0 17991
===> 100 17991
===> 200 17991
===> 300 17991
===> 400 17991
===> 406 17991
===> 407 17991
===> 408 17991
===> 409 17991
===> 410 17991
===> 411 17992
Optimized Stack Depth: 17992

Add --trace-opt:

Initial Stack Depth: 17991
===> 0 17991
===> 100 17991
===> 200 17991
===> 300 17991
===> 400 17991
===> 406 17991
===> 407 17991
[marking 0x15ed0019ab0d <JSFunction target (sfi = 0x15ed0019a99d)> for optimization to MAGLEV, ConcurrencyMode::kConcurrent, reason: hot and stable]
===> 408 17991
[compiling method 0x15ed0019ab0d <JSFunction target (sfi = 0x15ed0019a99d)> (target MAGLEV), mode: ConcurrencyMode::kConcurrent]
===> 409 17991
[completed compiling 0x15ed0019ab0d <JSFunction target (sfi = 0x15ed0019a99d)> (target MAGLEV) - took 0.000, 0.593, 0.008 ms]
===> 410 17991
===> 411 17992
Optimized Stack Depth: 17992

Exactly after Maglev compilation completes, the depth changes.

Analysis

Stack frame overflow

This is the interpreter processing flow for function calls.
Please refer to this if you want to analyze it yourself, otherwise skip the code.

src/execution/execution.cc

V8_WARN_UNUSED_RESULT MaybeHandle<Object> Invoke(Isolate* isolate,
                                                 const InvokeParams& params) {
  RCS_SCOPE(isolate, RuntimeCallCounterId::kInvoke);
  DCHECK(!IsJSGlobalObject(*params.receiver));
  DCHECK_LE(params.argc, FixedArray::kMaxLength);
  DCHECK(!isolate->has_exception());
  ...
  // Placeholder for return value.
  Tagged<Object> value;
  Handle<Code> code =
      JSEntry(isolate, params.execution_target, params.is_construct);
  {
    // Save and restore context around invocation and block the
    // allocation of handles without explicit handle scopes.
    SaveContext save(isolate);
    SealHandleScope shs(isolate);

    if (v8_flags.clear_exceptions_on_js_entry) isolate->clear_exception();

    if (params.execution_target == Execution::Target::kCallable) {
      // clang-format off
      // {new_target}, {target}, {receiver}, return value: tagged pointers
      // {argv}: pointer to array of tagged pointers
      using JSEntryFunction = GeneratedCode<Address(
          Address root_register_value, Address new_target, Address target,
          Address receiver, intptr_t argc, Address** argv)>;
      // clang-format on
      JSEntryFunction stub_entry =
          JSEntryFunction::FromAddress(isolate, code->instruction_start());

      Address orig_func = (*params.new_target).ptr();
      Address func = (*params.target).ptr();
      Address recv = (*params.receiver).ptr();
      Address** argv = reinterpret_cast<Address**>(params.argv);
      RCS_SCOPE(isolate, RuntimeCallCounterId::kJS_Execution);
      value = Tagged<Object>(
          stub_entry.Call(isolate->isolate_data()->isolate_root(), orig_func,
                          func, recv, JSParameterCount(params.argc), argv));
	...

Find and call the function entry.

Handle<Code> JSEntry(Isolate* isolate, Execution::Target execution_target,
                     bool is_construct) {
  if (is_construct) {
    DCHECK_EQ(Execution::Target::kCallable, execution_target);
    return BUILTIN_CODE(isolate, JSConstructEntry);
  } else if (execution_target == Execution::Target::kCallable) {
    DCHECK(!is_construct);
    return BUILTIN_CODE(isolate, JSEntry);
  } else if (execution_target == Execution::Target::kRunMicrotasks) {
    DCHECK(!is_construct);
    return BUILTIN_CODE(isolate, JSRunMicrotasksEntry);
  }
  UNREACHABLE();
}

src/builtins/x64/builtins-x64.cc

void Builtins::Generate_JSEntry(MacroAssembler* masm) {
  Generate_JSEntryVariant(masm, StackFrame::ENTRY, Builtin::kJSEntryTrampoline);
}
void Builtins::Generate_JSEntryTrampoline(MacroAssembler* masm) {
  Generate_JSEntryTrampolineHelper(masm, false);
}
static void Generate_JSEntryTrampolineHelper(MacroAssembler* masm,
                                             bool is_construct) {
  // Expects six C++ function parameters.
  // - Address root_register_value
  // - Address new_target (tagged Object pointer)
  // - Address function (tagged JSFunction pointer)
  // - Address receiver (tagged Object pointer)
  // - intptr_t argc
  // - Address** argv (pointer to array of tagged Object pointers)
  // (see Handle::Invoke in execution.cc).

  // Open a C++ scope for the FrameScope.
  {
    // Platform specific argument handling. After this, the stack contains
    // an internal frame and the pushed function and receiver, and
    // register rax and rbx holds the argument count and argument array,
    // while rdi holds the function pointer, rsi the context, and rdx the
    // new.target.

    // MSVC parameters in:
    // rcx        : root_register_value
    // rdx        : new_target
    // r8         : function
    // r9         : receiver
    // [rsp+0x20] : argc
    // [rsp+0x28] : argv
    //
    // GCC parameters in:
    // rdi : root_register_value
    // rsi : new_target
    // rdx : function
    // rcx : receiver
    // r8  : argc
    // r9  : argv

    __ movq(rdi, kCArgRegs[2]);
    __ Move(rdx, kCArgRegs[1]);
    // rdi : function
    // rdx : new_target
  ...
    // Invoke the builtin code.
    Builtin builtin = is_construct ? Builtin::kConstruct : Builtins::Call();
    __ CallBuiltin(builtin);
    ...
void Builtins::Generate_Call_ReceiverIsAny(MacroAssembler* masm) {
  Generate_Call(masm, ConvertReceiverMode::kAny);
}
void Builtins::Generate_CallFunction_ReceiverIsAny(MacroAssembler* masm) {
  Generate_CallFunction(masm, ConvertReceiverMode::kAny);
}
// static
void Builtins::Generate_CallFunction(MacroAssembler* masm,
                                     ConvertReceiverMode mode) {
  // ----------- S t a t e -------------
  //  -- rax : the number of arguments
  //  -- rdi : the function to call (checked to be a JSFunction)
  // ------
  ...
  __ movzxwq(
      rbx, FieldOperand(rdx, SharedFunctionInfo::kFormalParameterCountOffset));
  __ InvokeFunctionCode(rdi, no_reg, rbx, rax, InvokeType::kJump);
}

src/codegen/x64/macro-assembler-x64.cc

void MacroAssembler::InvokeFunctionCode(Register function, Register new_target,
                                        Register expected_parameter_count,
                                        Register actual_parameter_count,
                                        InvokeType type) {
  ...
  switch (type) {
    case InvokeType::kCall:
      CallJSFunction(function);
      break;
    case InvokeType::kJump:
      JumpJSFunction(function);
      break;
  }
  ...  
void MacroAssembler::CallJSFunction(Register function_object) {
  static_assert(kJavaScriptCallCodeStartRegister == rcx, "ABI mismatch");
#ifdef V8_ENABLE_SANDBOX
  // When the sandbox is enabled, we can directly fetch the entrypoint pointer
  // from the code pointer table instead of going through the Code object. In
  // this way, we avoid one memory load on this code path.
  LoadCodeEntrypointViaCodePointer(
      rcx, FieldOperand(function_object, JSFunction::kCodeOffset));
  call(rcx);
#else
  LoadTaggedField(rcx, FieldOperand(function_object, JSFunction::kCodeOffset));
  CallCodeObject(rcx);
#endif
}

InterpreterEntryTrampoline is executed for function_object.

src/builtins/x64/builtins-x64.cc

void Builtins::Generate_InterpreterEntryTrampoline(
    MacroAssembler* masm, InterpreterEntryTrampolineMode mode) {
  Register closure = rdi;
  ...
  // Allocate the local and temporary register file on the stack.
  Label stack_overflow;
  {
    // Load frame size from the BytecodeArray object.
    __ movl(rcx, FieldOperand(kInterpreterBytecodeArrayRegister,
                              BytecodeArray::kFrameSizeOffset));

    // Do a stack check to ensure we don't go over the limit.
    __ movq(rax, rsp);
    __ subq(rax, rcx);
    __ cmpq(rax, __ StackLimitAsOperand(StackLimitKind::kRealStackLimit));
    __ j(below, &stack_overflow);

    // If ok, push undefined as the initial value for all register file entries.
    Label loop_header;
    Label loop_check;
    __ LoadRoot(kInterpreterAccumulatorRegister, RootIndex::kUndefinedValue);
    __ jmp(&loop_check, Label::kNear);
    __ bind(&loop_header);
    // TODO(rmcilroy): Consider doing more than one push per loop iteration.
    __ Push(kInterpreterAccumulatorRegister);
    // Continue loop if not done.
    __ bind(&loop_check);
    __ subq(rcx, Immediate(kSystemPointerSize));
    __ j(greater_equal, &loop_header, Label::kNear);
  }
  __ bind(&stack_overflow);
  __ CallRuntime(Runtime::kThrowStackOverflow);
  __ int3();  // Should not return.
}

This is the part related to checking the stack frame size in the above code.

  • movl(rcx, FieldOperand(kInterpreterBytecodeArrayRegister, BytecodeArray::kFrameSizeOffset));
  • movq(rax, rsp);
  • subq(rax, rcx);
  • cmpq(rax, __ StackLimitAsOperand(StackLimitKind::kRealStackLimit));
  • j(below, &stack_overflow);

This will further expand(=subq) the stack as much as rcx, and if rsp exceeds the limit when doing so, it will be processed as stack_overflow.

Stack Limit

This is an analysis of the size of the Stack Limit.

src/codegen/x64/macro-assembler-x64.cc

Operand MacroAssembler::StackLimitAsOperand(StackLimitKind kind) {
  DCHECK(root_array_available());
  intptr_t offset = kind == StackLimitKind::kRealStackLimit
                        ? IsolateData::real_jslimit_offset()
                        : IsolateData::jslimit_offset();

  CHECK(is_int32(offset));
  return Operand(kRootRegister, static_cast<int32_t>(offset));
}

src/execution/isolate-data.h

  static constexpr int real_jslimit_offset() {
    return stack_guard_offset() + StackGuard::real_jslimit_offset();
  }
#define V(Offset, Size, Name) \
  static constexpr int Name##_offset() { return Offset - kIsolateRootBias; }
  ISOLATE_DATA_FIELDS(V)
#undef V

src/execution/stack-guard.h

  static constexpr int real_jslimit_offset() {
    return offsetof(StackGuard, thread_local_) +
           offsetof(ThreadLocal, real_jslimit_);
  }

That is, it returns the offset of kRootRegister->StackGuard->ThreadLocal.real_jslimit_ from kRootRegister.

void StackGuard::ThreadLocal::Initialize(Isolate* isolate,
                                         const ExecutionAccess& lock) {
  const uintptr_t kLimitSize = v8_flags.stack_size * KB;
  DCHECK_GT(GetCurrentStackPosition(), kLimitSize);
  uintptr_t limit = GetCurrentStackPosition() - kLimitSize;
  real_jslimit_ = SimulatorStack::JsLimitFromCLimit(isolate, limit);
  set_jslimit(SimulatorStack::JsLimitFromCLimit(isolate, limit));
  real_climit_ = limit;
  set_climit(limit);
  interrupt_scopes_ = nullptr;
  interrupt_flags_ = 0;
}

As much as kLimitSize is available at the current stack position.

src/common/globals.h

#define V8_DEFAULT_STACK_SIZE_KB 984

kLimitSize = V8_DEFAULT_STACK_SIZE_KB * KB = 0x000f6000

A stack frame of size 0x38 is used per getStackDepth call, and a rough calculation considering the speculative stack frames used in functions such as Trampoline that calls getStackDepth is (0x000f6000-(0x41~0x78))/ 0x38 = 17991.

Maglev

stack limit check before Maglev:

(lldb) c
Process 15726 resuming
Process 15726 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001611c43df
->  0x1611c43df: cmpq   -0x60(%r13), %rsp
    0x1611c43e3: jbe    0x1611c447f
    0x1611c43e9: movabsq $0x51d0019aad1, %rdi      ; imm = 0x51D0019AAD1
    0x1611c43f3: movl   0x13(%rdi), %esi
Target 0: (d8) stopped.
(lldb) x/xg $r13-0x60
0x7fad30058020: 0x0000000306e25290  ; Stack Limit
(lldb) p/x $rsp
(unsigned long) 0x0000000306f1b218  ; Stack Pointer
(lldb) p/x 0x0000000306f1b218-0x0000000306e25290
(long) 0x00000000000f5f88

stack limit check after Maglev:

(lldb) c
Process 15726 resuming
Process 15726 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
    frame #0: 0x00000001611c43df
->  0x1611c43df: cmpq   -0x60(%r13), %rsp
    0x1611c43e3: jbe    0x1611c447f
    0x1611c43e9: movabsq $0x51d0019aad1, %rdi      ; imm = 0x51D0019AAD1
    0x1611c43f3: movl   0x13(%rdi), %esi
Target 0: (d8) stopped.
(lldb) x/xg $r13-0x60
0x7fad30058020: 0x0000000306e25290  ; Stack Limit
(lldb) p/x $rsp
(unsigned long) 0x0000000306f1b238  ; Stack Pointer

(lldb) p/x 0x0000000306f1b238-0x0000000306e25290
(long) 0x00000000000f5fa8

I checked with a break in the StackOverflow check, and when the target function called getStackDepth and compared the Stack Limit, there was a difference in the stack pointer pointed to by the first call stack.
After Maglev compilation, the size of the stack frame used by the target function has become as small as 0x20.

Where did this get smaller?

backtrace before Maglev:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 17.1
  * frame #0: 0x00000001611c43c0 ; getStackDepth
    frame #1: 0x00000001611c4426 ; getStackDepth
    frame #2: 0x00000001611c4426 ; getStackDepth
    frame #3: 0x00000001011e04aa d8`Builtins_InterpreterEntryTrampoline + 298
    frame #4: 0x00000001011e04aa d8`Builtins_InterpreterEntryTrampoline + 298
    frame #5: 0x00000001011ddf1c d8`Builtins_JSEntryTrampoline + 92
    frame #6: 0x00000001011ddc47 d8`Builtins_JSEntry + 135

; frame #3
(lldb) p/x $rsp
(unsigned long) 0x0000000306f1b208
(lldb) p/x $rbp
(unsigned long) 0x0000000306f1b248

backtrace after Maglev:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 16.1
  * frame #0: 0x00000001611c43c0 ; getStackDepth
    frame #1: 0x00000001611c4426 ; getStackDepth
    frame #2: 0x00000001611c4426 ; getStackDepth
    frame #3: 0x00000001611c4bbe ; target
    frame #4: 0x00000001011e04aa d8`Builtins_InterpreterEntryTrampoline + 298
    frame #5: 0x00000001011ddf1c d8`Builtins_JSEntryTrampoline + 92
    frame #6: 0x00000001011ddc47 d8`Builtins_JSEntry + 135

; frame #3
(lldb) p/x $rsp
(unsigned long) 0x0000000306f1b228
(lldb) p/x $rbp
(unsigned long) 0x0000000306f1b248

Before Maglev compilation, Builtins_InterpreterEntryTrampoline is called twice, to call the target function in the bytecode and the compiled getStackDepth that the target function calls.
However, after Maglev compilation, the compiled target function and the compiled getStackDepth are called, so Builtins_InterpreterEntryTrampoline is reduced to once.
In other words, the stack frame of the Builtins_InterpreterEntryTrampoline function uses 0x40 bytes, but the stack frame of the compiled target function uses 0x20 bytes, so the compiled version has 0x20 space before reaching the stack limit.
This difference allowed the recursive function to be executed once more and the target function to return different results before and after Maglev compilation.