24-core CPU and I can’t type an email (part one)

0
11

news image

I wasn’t looking for trouble. I wasn’t trying to build Chrome a thousand times in a weekend, I was just engaging in that most mundane of 21st century tasks, writing an email at 10:30 am. And suddenly gmail hung. I kept typing but for several seconds but no characters were appearing on screen. Then, suddenly gmail caught up and I resumed my very important email. Then it happened again, only this time gmail went unresponsive for even longer. Well that’s funny…

I have trouble resisting a good performance mystery but in this case the draw was particularly strong. I work on Chrome, on Windows, focused on performance. Investigating this hang was actually my job. And after a lot of false starts and some hard work I figured out how Chrome, gmail, Windows, and our IT department were working together to prevent me from typing an email, and in the process I found a way to save a significant amount of memory for some web pages in Chrome.

This investigation had so many rabbit holes that I’m going to save some of the digressions for a follow-on post, but this one will entirely explain the hangs.

As usual I had UIforETW running in the background, tracing to circular buffers, so I just had to type Ctrl+Win+R and the buffers, representing the last thirty seconds or so of system activity, were saved to disk. I loaded it into Windows Performance Analyzer (WPA), but I couldn’t definitively find the hang.

When a Windows program stops pumping messages there will be ETW events emitted to indicate exactly where this happened, so those types of hangs are trivial to find. But apparently Chrome kept on pumping messages. I looked for a time when one of Chrome’s crucial threads either went into a busy loop or went completely idle, but I couldn’t find anything conclusive. There were some spots where Chrome went mostly idle, but even then all of the key threads kept chugging along, so I couldn’t be sure where the hang was – Chrome might have been going idle because nothing was happening:

image

imageUIforETW has a built-in key logger which is often useful when identifying key points in a trace. However, for obvious security reasons it defaults to anonymizing the keystrokes, treating every number typed as a ‘1’ and every letter as an ‘A’. This was making it hard to find the exact moment of the hang so I turned the input tracing from “Private” to “Full” and waited for the hang. At around 10:30 the next morning the hang happened again. I saved the trace buffers, and put these notes into UIforETW’s trace information field:

Was typing “defer to those with more scuba experience” and gmail hung around the end of “those” and then resumed partway through “experience”. gmail tab was PID 27368.

Clearly just a normal work-related discussion of commute methods, but the important thing is that now I had a way of locating the hang in the ETW trace. I loaded the trace, looked at the keyboard data in the Generic Events field (the events are emitted by UIforETW itself, each one is a purple diamond in the screen shot below) and suddenly I could see exactly where the hang was, and clearly the dip in CPU usage was 100% correlated:

image

Okay, but why is Chrome stopping? Here’s some foreshadowing: not shown in these screenshots is that WmiPrvSE.exe was consuming an entire CPU hyperthread whenever this happened. But that shouldn’t matter. I have a 24-core/48-thread machine so consuming one hyperthread means my machine is still about 98% idle.

I then zoomed in to the clearly-significant period where Chrome was idle and looked in particular at CrRendererMain in chrome.exe (27368), the gmail tab process.

Aside: I just want to say thank you to 2015 me for asking Microsoft to improve their thread naming mechanisms, and thank you to Microsoft for doing everything I suggested – thread names in WPA are great!

The problem suddenly became clearer. During the 2.81 s hang this thread was scheduled and ran 440 separate times. Normally running every 6 ms is enough to make a program responsive, but for some reason it wasn’t making any progress. I noticed that every time it woke up it was on the same stack. Simplified down it was:

chrome_child.dll (stack base)
KernelBase.dll!VirtualAlloc
ntoskrnl.exe!MiCommitVadCfgBits
ntoskrnl.exe!MiPopulateCfgBitMap
ntoskrnl.exe!ExAcquirePushLockExclusiveEx
ntoskrnl.exe!KeWaitForSingleObject (stack leaf)

Chrome is calling VirtualAlloc, which tries to update some “CfgBits” and needs to acquire a lock. At first I assumed that Chrome was calling VirtualAlloc 440 times, which seemed like a weird thing to do, but the reality was perhaps weirder. Chrome called VirtualAlloc once and a lock needed to be acquired. Chrome was signaled when that lock was available but – for 439 times in a row – when Chrome woke up and tried to acquire the lock it was unavailable. The lock had been reacquired by the process that had just released it.

This is because Windows locks are, by design, not fair and if a thread releases a lock and then tries to reacquire it immediately then it can, in cases like this, reacquire it every single time. Starvation. More on this next time.

The Readying Process – the one that signaled Chrome to go grab the lock – in every case was WmiPrvSE.exe, releasing the lock on this stack:

ntoskrnl.exe!KiSystemServiceCopyEnd (stack base)
ntoskrnl.exe!NtQueryVirtualMemory
ntoskrnl.exe!MmQueryVirtualMemory
ntoskrnl.exe!MiUnlockAndDereferenceVad
ntoskrnl.exe!ExfTryToWakePushLock (stack leaf)

I got distracted by WMI for a while (more on that next time) but eventually I wrote a program to recreate WMI’s behavior. I had CPU sampling data that showed me where WmiPrvSE.exe was spending its time (some edits and simplifications made):

WmiPerfClass.dll!EnumSelectCounterObjects (stack base)
WmiPerfClass.dll!ConvertCounterPath
pdh.dll!PdhiTranslateCounter
pdh.dll!GetSystemPerfData
KernelBase.dll!blah-blah-blah
advapi32.dll!blah-blah-blah
perfproc.dll!blah-blah-blah
perfproc.dll!GetProcessVaData
ntdll.dll!NtQueryVirtualMemory
ntoskrnl.exe!NtQueryVirtualMemory
ntoskrnl.exe!MmQueryVirtualMemory
ntoskrnl.exe!MiQueryAddressSpan
ntoskrnl.exe!MiQueryAddressState
ntoskrnl.exe!MiGetNextPageTable (stack leaf)

It turns out that reproducing the slow scanning from the sampling data was quite easy. My VirtualScan program just called NtQueryVirtualMemory in the obvious loop to scan the address space of a specified process, the code worked, it took a really long time to scan the gmail process (10-15 seconds), and it triggered the hang. But why?

The advantage of writing the scanning loop myself was that I could start generating statistics. NtQueryVirtualMemory returns data on each contiguous range of address space that has matching attributes. That might mean all reserved, all committed with particular protection settings, etc. The gmail process had about 26,000 of these blocks but I found another process (WPA, as it turns out) that had 16,000 memory blocks but could be scanned very quickly.

At some point I looked at the gmail process with vmmap and noticed that the gmail process had a fair amount of memory (361,836 KiB) and a lot of separate blocks (49,719) coming from a single reservation in the Shareable category – a reservation with a size of 2,147,483,648 KiB, aka 2 TiB. Wot?

image

I happened to know that the 2 TiB reservation is used for Control Flow Guard (CFG), and I remembered that “CFG” showed up on the call stacks where the gmail Chrome process was waiting – MiCommitVadCfgBits was on the stack. Maybe the large number of blocks in the CFG region were the problem!

Control Flow Guard (CFG) is used to try to stop exploits. Its 2 TiB reservation is a sparse bitmap which indicates which addresses (across 128 TiB of user-mode address space) are valid indirect branch targets. So I modified my virtual memory scanner to count how many blocks were in the CFG region (just look for the 2 TiB reservation) and how many blocks were tagged as executable. Since CFG memory is used to describe executable memory I expected to see one block of CFG memory for each block of executable memory in the process. Instead I saw 98 blocks of executable memory and 24,866 blocks of committed CFG memory. A significant discrepancy:

     Scan time,  Committed, page tables, committed blocks
Total: 41.763s, 1457.7 MiB,    67.7 MiB,  32112, 98 code blocks
  CFG: 41.759s,  353.3 MiB,    59.2 MiB,  24866

vmmap shows reservations and commited memory as blocks and my scanning tool only counts committed blocks – that’s why vmmap shows 49,684 blocks where my scanning tool reports 24,866

In hindsight it is obvious, but what if the CFG bitmap never gets cleaned up? What if memory gets allocated in the CFG bitmap when you allocate executable memory, but it doesn’t get freed when you free the executable memory. That would explain this behavior.

End-to-end repro

The next step was to write a VAllocStress program that allocates and then frees thousands of blocks of executable memory at random addresses. This executable needs to have CFG enabled and needs to be 64-bit and I hit some road blocks because I didn’t really know what I was trying to do but it worked. After allocating and freeing this many blocks of executable memory this program should sit in a loop where it occasionally try to allocate/free more executable memory, and note when this is slow. Here is my basic algorithm for VAllocStress:

  1. Spin in a loop a lot of times:
    1. Allocate some executable memory with VirtualAlloc at a random address
    2. Free the memory
  2. Then, spin in an infinite loop:
    1. Sleep for 500 ms (don’t want to be a CPU hog)
    2. Allocate some executable memory with VirtualAlloc at a fixed location
    3. Print a message if the VirtualAlloc call takes more than ~500 ms
    4. Free the memory

That’s it. It’s super simple. And it was so satisfying when it worked. I ran it and then got my VirtualScan program to scan my VAllocStress process. I quickly got confirmation that the CFG block was horribly fragmented, and the scanning took a long time. And my VAllocStress program hung for the duration of the scan!

At this point I had simulated the CFG fragmentation, the long scanning times, and the hang. Huzzah!

Root-cause

It turns out that v8 (Chrome’s JavaScript engine) has CodeRange objects for managing code-gen, and each CodeRange object is constrained to a 128 MiB address range. This should be small enough to avoid the rampant CFG allocation I was seeing.

But what if you have multiple CodeRange objects, and what if those get allocated at random addresses and then freed? I instrumented the CodeRange constructor, left gmail running, and found the smoking gun. Every couple of minutes another CodeRange object was created (and destroyed). Being in the debugger made it easy to find that WorkerThread::Start was allocating these objects, and suddenly everything makes sense:

  1. gmail uses service workers, probably for implementing offline mode
  2. These appear and disappear every few minutes because that’s how service workers behave
  3. Each worker thread gets a temporary CodeRange object which allocates some executable pages for JITted JavaScript code from a random location in the 47-bit process address space
  4. The 2 TB CFG memory reservation gets some entries every time new code pages are allocated
  5. The CFG allocations are never freed
  6. NtQueryVirtualMemory is painfully slow when scanning CFG memory (about 1 ms per block) for reasons that I don’t understand

The slow scanning of CFG memory is actually fixed in Windows 10 RS4 (April 2018 Update), which briefly had me wondering if all of this investigation was pointless. But it’s not.

Memory

The CFG reservation starts out as just an address reservation – no memory is actually committed. As executable pages are allocated however parts of the CFG reservation are turned into committed memory, using real memory pages. And those pages never get freed. If you keep allocating and freeing randomly located blocks of executable memory then the CFG region will grow arbitrarily large! Well, that’s not quite true. The CFG memory block is best thought of a cache with bounded size. However that is little consolation when the bounded size is 2 TiB, per-process!

The worst I’ve seen so far is when my gmail tab had been running for eight days and had accumulated 353.3 MiB of CFG memory, and 59.2 MiB of page-tables to map that memory, for a total waste of about 400 MiB. For some reason most people see either no symptoms or much milder symptoms than I do.

Resolution

The v8 (Chrome’s JavaScript engine) team has updated their code so that it reuses addresses for CodeRange objects, which will squish this bug quite nicely. Microsoft already fixed their CFG memory scanning performance problems. Maybe some day Microsoft will free CFG regions when executable memory is freed, at least in the easy cases where large address ranges are freed. And a vmmap bug was reported.

Most of my coworkers and our customers never noticed this problem. I only noticed it because:

  • I had gmail’s offline mode enabled
  • I was running on an old version of Windows 10
  • Our IT department was running regular WMI scans of our computers
  • I was paying attention
  • I got lucky

It seems highly improbably that, as one of the most qualified people to diagnose this bug, I was the first to notice it. If any of the factors above had not been true then I would not have experienced this hang and the associated memory waste might never have been found.

By the way, the reason that the hangs kept happening at 10:30 am is just because that’s when our IT team ran their inventory scans. If you want to trigger a scan manually, you can go to Control Panel | Configuration Manager | Actions | Select ‘Hardware Inventory Cycle’ and then Run Now.

clip_image002

Source code

If you want to play along at home you can grab the source-code to VAllocStress and VirtualScan from github, here.

Rabbit holes

This investigation included a lot of rabbit holes. Some were important (page tables), some were instructive (locks), and some were purely a waste of time (WMI details). I’ll talk more about these, and a vmmap bug, in a follow-up post. If you want more gory details – some of which ended up being wrong or irrelevant – you can follow the whole investigation at crbug.com/870054.

Hacker news discussion is here.

Read More

This site uses Akismet to reduce spam. Learn how your comment data is processed.