How does macOS manage with so many cores in the M1 CPU?
Editor's note: This article originally was published on Dr. Howard Oakley's personal blog and has been translated, annotated, and posted with the author's permission. This article has been modified to some extent to add detail and to facilitate understanding by the general reader.
On November 11, 2020, Apple launched the M1 chip in Cupertino, California. Not only was it Apple's first ARM-based processor for personal computers, but it was also a powerful processor that made even the long-suffering Geekers feel excited.
But the visible scenery is always separated from the unseen effort. As a chip designed and optimized specifically for the Mac, how exactly does the system schedule programs on the M Series processor.
About Dr. Howard Oakley
Asymmetric processor architecture
In previous Intel processor models of the Mac, all the cores of the Intel processor were identical, so this processor was a Symmetric multiprocessing (SMP) architecture. What the system has to do is really simple: keep the load on each core roughly similar.
Opening the CPU History window of the Activity Monitor on an Intel processor Mac, we can notice that the chart is divided into two columns, with the odd-numbered cores on the left being the real physical cores and the virtualized cores from Intel Hyper-Threading technology on the right. You can see that under high load, the system spreads the load evenly across all cores, while under lighter load the system places the load primarily on the real physical cores.
In total, four chips have been released in the M1 series, starting in 2020 and ending today, namely
- M1 (2020)
- M1 Pro 与 M1 Max (2021 年)
- M1 Ultra (2022 年)
powermetrics
we can tell that The E core has a maximum frequency of 2064Mhz, while the P core is split between 3204Mhz for the M1 chip and up to 3228 MHz for the M1 Pro/Max/Ultra. If the system remains as before, it will not only waste more of the middle gear on the P-core, but also make programs running on the E core significantly slower.
In addition, the M1 and M1 Pro/Max/Ultra have completely different combinations of E and P cores, and each processor has a choice of different CPU counts. "This logic is intuitively cumbersome.
- M1 consists of one E cluster (containing 4 E cores) and one P cluster (containing 4 P cores) named E and P0, respectively
- M1 Pro/Max consists of one E cluster (containing 2 E cores) and two P clusters (each containing 4 P cores), named E, P0, and P1
- M1 Ultra consists of one E cluster (containing 4 E cores) and four P clusters (each containing 4 P cores), named E, P0, P1, P2, and P3
For example, Logic Pro importing material would be an extreme case of how thread control works
In actual application development, macOS does not provide a public API for applications to use specific cores, core types, or clusters directly; instead, applications are typically managed by Grand Central Dispatch using QoS, and macOS then uses these settings to determine the management policy for specific threads.
In practice, threads with the lowest QoS will only be dispatched to the E core cluster, while threads with higher QoS may be dispatched to the E or P core cluster. Although the dispatch can be dynamically modified via the command tool taskpolicy or the function setpriority() in the code, it is only available for higher QoS threads. The "lowest QoS threads only run on E clusters" rule remains the same.
Background threads (Background threads)
Because the E core clusters on the M1 and M1 Pro/Max chips are different in size, with the former having 4 E cores and the latter only 2, there is a difference in how the minimum QoS threads are loaded and run on the M1 and M1 Pro/Max.
When running a thread with a QoS of 9 on an M1 chip with 4 E cores, each E core runs at around 1000M (1 GHz), while in an M1 Pro/Max with only 2 E cores running the same QoS of 9, the E cores also run at 1000 MHz if there is only one thread, but if there are two or more, the frequency of each E core increases to 2064 MHz. This design ensures that the E cluster in M1 Pro/Max provides at least the same background task performance as the M1, even if the cluster size is different.
Of course there will still be exceptions here, threads like backupd
that have the lowest QoS will always run at ~1000MHz even on M1 Pro/Max if they are also limited by current from I/O.
User initiated threads (User threads)
All threads with a QoS higher than 9 are handled in a similar way, the difference between them is simply that they have different priorities. High QoS threads are eligible to run on any of the cores or clusters, though they are handled differently on M1 and M1 Pro/Max.
On M1, since there is only one P cluster and one E cluster, and a total of 8 physical cores, only a maximum of 8 threads can be assigned to these two clusters at any one time, with each cluster being allocated 4 threads. If the number of threads to be assigned at the same time is less than or equal to 4, the system will try to run them on the P cluster, unless there are more threads of higher QoS level waiting to run in the current queue, in which case the E cluster will be used additionally to run such tasks. In the above case, the maximum frequency of the P-core will be 3GHz and the maximum frequency of the E-core will be 2GHz, which is twice as high as when running threads with a QoS of 9.
However, M1 Pro/Max has 3 clusters, two clusters with 4 P cores each, and one E cluster with 2 cores. If the number of threads to be allocated at the same time is less than or equal to 4, the system will actively divide the threads into the first P (P0) cluster, and the second P cluster will remain unloaded and inactive at all times; if there are more than 4 threads to be allocated at the same time, the extra threads (greater than or equal to 5 and less than or equal to 8) will be allocated to the second P (P1) cluster; if there are more If there are more threads (greater than or equal to 1 and less than or equal to 2) etc. running at this time, then these processes will be reallocated to the E cluster. In the above case, the maximum frequency of the P-core will be 3228 MHz and the maximum frequency of the E-core will be 2064 MHz.
The M1 Ultra chips have a total of 5 clusters, each with 4 cores. They have roughly the same strategy as the M1 Pro/Max, except that the 4 P clusters are called in preference before the E clusters are used.
However, there are two cases where the code appears to run only on a single core.
The first happens during the boot process, when the code runs only on a single E core before the kernel initializes and runs on the other cores. The other happens when, after downloading the macOS update and in the 'ready' phase, the 5 update threads of macOS are given active residency on just one P-core on the M1 Pro/Max chip, the first of the 2 P-clusters (P0, labeled Core 3 below).
This uncommon activity resides all the way through the 30 minutes it takes to prepare to install the update. Patterns under load (Patterns under load)
Here are a few typical examples of macOS policies affecting scheduling, taken from the CPU history window of the Activity Monitor.
Currently, the Activity Monitor does not provide one important piece of information about M-Series processors - the cluster frequency. With the CPU at 100% load, which is equivalent to active residency, the cluster completes instructions at frequencies below 1000MHz less than half as fast as the same cluster at 2064MHz. Unfortunately, the only means of obtaining frequency information is currently the command tool powermetrics
.
Below is a summary of macOS management of the CPU cores in the M1, M1 Pro and Max chips. Information on the M1 Ultra is still being compiled and will be added as it becomes available. If you are using the M1 Ultra, are familiar with it, and would like to help, please feel free to contact the author, Dr. Howard Oakley.
Thanks to Walt for the info on the Ultra and the screenshots under load.