How does macOS manage with so many cores in the M1 CPU?

Editor's note: This article originally was published on Dr. Howard Oakley's personal blog and has been translated, annotated, and posted with the author's permission. This article has been modified to some extent to add detail and to facilitate understanding by the general reader.

In January 1984, Apple began designing, developing and marketing the Macintosh line of personal computers. In those nearly 40 years, Apple has written many milestones in the history of technology, including three very important points - the migration from the Motorola 68000 architecture to the PowerPC platform in 1994, the migration from the PowerPC platform to the Intel x86 platform in 2005, and the migration from the Intel x86 platform to Apple Silicon in 2020. The migration from the Intel x86 platform to Apple Silicon in 2020.

On November 11, 2020, Apple launched the M1 chip in Cupertino, California. Not only was it Apple's first ARM-based processor for personal computers, but it was also a powerful processor that made even the long-suffering Geekers feel excited.

But the visible scenery is always separated from the unseen effort. As a chip designed and optimized specifically for the Mac, how exactly does the system schedule programs on the M Series processor.

About Dr. Howard Oakley

Howard Oakley Dr. is currently a Mac software developer and founder of the website Eclectic Light Company. His love affair with the Mac began when he fell in love with the Macintosh SE and Macintosh Programmer's Workshop, and he's been addicted to them ever since.

Asymmetric processor architecture

In previous Intel processor models of the Mac, all the cores of the Intel processor were identical, so this processor was a Symmetric multiprocessing (SMP) architecture. What the system has to do is really simple: keep the load on each core roughly similar.

Opening the CPU History window of the Activity Monitor on an Intel processor Mac, we can notice that the chart is divided into two columns, with the odd-numbered cores on the left being the real physical cores and the virtualized cores from Intel Hyper-Threading technology on the right. You can see that under high load, the system spreads the load evenly across all cores, while under lighter load the system places the load primarily on the real physical cores.

The horizontal axis is time, and the vertical axis is load, with time getting progressively closer to the time from left to right but the Apple Silicon on CPU is completely different, as its processor section is all made up of two different CPU cores, one called the Firestorm High Performance core (sometimes abbreviated as P-core) and the other called the Icestorm High Efficiency core (sometimes abbreviated as E-core), this asymmetric processor is called asymmetric multiprocessing (Asymmetric Multiprocessing, abbreviated as AMP, ASMP) machine, or Heterogeneous computing (Heterogeneous Computing) processors.

In total, four chips have been released in the M1 series, starting in 2020 and ending today, namely

M1 (2020)
M1 Pro 与 M1 Max (2021 年)
M1 Ultra (2022 年)

The E core has 5 frequencies to choose from and the P core has 15 frequencies to choose from via powermetrics we can tell that The E core has a maximum frequency of 2064Mhz, while the P core is split between 3204Mhz for the M1 chip and up to 3228 MHz for the M1 Pro/Max/Ultra. If the system remains as before, it will not only waste more of the middle gear on the P-core, but also make programs running on the E core significantly slower.

In addition, the M1 and M1 Pro/Max/Ultra have completely different combinations of E and P cores, and each processor has a choice of different CPU counts. "This logic is intuitively cumbersome.

To simplify kernel management, macOS divides kernels into 2 to 4 clusters of the same type, depending on their function; clusters can be understood as groups. However, the kernel numbering at the system level is the same as the kernel numbering shown in powermetrics, but not the same as the kernel numbering shown in the activity monitor; so for consistency, the text will use the activity monitor's kernel numbering rules, but number them according to system clusters. Under macOS Monterey 12.3.1, the functional clusters for the three chips of the M1 series are as follows.

M1 consists of one E cluster (containing 4 E cores) and one P cluster (containing 4 P cores) named E and P0, respectively
M1 Pro/Max consists of one E cluster (containing 2 E cores) and two P clusters (each containing 4 P cores), named E, P0, and P1
M1 Ultra consists of one E cluster (containing 4 E cores) and four P clusters (each containing 4 P cores), named E, P0, P1, P2, and P3

In theory, all the cores within a cluster will run at the same frequency, and usually (but not always) keep the load on each cluster's " the load on the cores' is roughly similar. In extreme cases it may even happen that the system schedules all the tasks on one core in a cluster in a single stream.

For example, Logic Pro importing material would be an extreme case of how thread control works

In actual application development, macOS does not provide a public API for applications to use specific cores, core types, or clusters directly; instead, applications are typically managed by Grand Central Dispatch using QoS, and macOS then uses these settings to determine the management policy for specific threads.

In practice, threads with the lowest QoS will only be dispatched to the E core cluster, while threads with higher QoS may be dispatched to the E or P core cluster. Although the dispatch can be dynamically modified via the command tool taskpolicy or the function setpriority() in the code, it is only available for higher QoS threads. The "lowest QoS threads only run on E clusters" rule remains the same.

The threaded QoS for installing Xcode via the macOS App Store is the lowest possible, and does not use P-core macOS at all Its own policy is to run most background tasks at the lowest possible QoS. This includes Time Machine's automatic backups, Spotlight index updates, and Archive Utility's compression and decompression. It's worth mentioning that many people may have an intuitive feeling about Archive Utility: downloading a copy of Xcode in xip format and decompressing it takes a long time, but this is because a lot of code is restricted to the E core, and users can't actively move it to the P core.

Background threads (Background threads)

Because the E core clusters on the M1 and M1 Pro/Max chips are different in size, with the former having 4 E cores and the latter only 2, there is a difference in how the minimum QoS threads are loaded and run on the M1 and M1 Pro/Max.

When running a thread with a QoS of 9 on an M1 chip with 4 E cores, each E core runs at around 1000M (1 GHz), while in an M1 Pro/Max with only 2 E cores running the same QoS of 9, the E cores also run at 1000 MHz if there is only one thread, but if there are two or more, the frequency of each E core increases to 2064 MHz. This design ensures that the E cluster in M1 Pro/Max provides at least the same background task performance as the M1, even if the cluster size is different.

Of course there will still be exceptions here, threads like backupd that have the lowest QoS will always run at ~1000MHz even on M1 Pro/Max if they are also limited by current from I/O.

User initiated threads (User threads)

All threads with a QoS higher than 9 are handled in a similar way, the difference between them is simply that they have different priorities. High QoS threads are eligible to run on any of the cores or clusters, though they are handled differently on M1 and M1 Pro/Max.

On M1, since there is only one P cluster and one E cluster, and a total of 8 physical cores, only a maximum of 8 threads can be assigned to these two clusters at any one time, with each cluster being allocated 4 threads. If the number of threads to be assigned at the same time is less than or equal to 4, the system will try to run them on the P cluster, unless there are more threads of higher QoS level waiting to run in the current queue, in which case the E cluster will be used additionally to run such tasks. In the above case, the maximum frequency of the P-core will be 3GHz and the maximum frequency of the E-core will be 2GHz, which is twice as high as when running threads with a QoS of 9.

However, M1 Pro/Max has 3 clusters, two clusters with 4 P cores each, and one E cluster with 2 cores. If the number of threads to be allocated at the same time is less than or equal to 4, the system will actively divide the threads into the first P (P0) cluster, and the second P cluster will remain unloaded and inactive at all times; if there are more than 4 threads to be allocated at the same time, the extra threads (greater than or equal to 5 and less than or equal to 8) will be allocated to the second P (P1) cluster; if there are more If there are more threads (greater than or equal to 1 and less than or equal to 2) etc. running at this time, then these processes will be reallocated to the E cluster. In the above case, the maximum frequency of the P-core will be 3228 MHz and the maximum frequency of the E-core will be 2064 MHz.

The M1 Ultra chips have a total of 5 clusters, each with 4 cores. They have roughly the same strategy as the M1 Pro/Max, except that the 4 P clusters are called in preference before the E clusters are used.

However, there are two cases where the code appears to run only on a single core.

The first happens during the boot process, when the code runs only on a single E core before the kernel initializes and runs on the other cores. The other happens when, after downloading the macOS update and in the 'ready' phase, the 5 update threads of macOS are given active residency on just one P-core on the M1 Pro/Max chip, the first of the 2 P-clusters (P0, labeled Core 3 below).

This uncommon activity resides all the way through the 30 minutes it takes to prepare to install the update. Patterns under load (Patterns under load)

Here are a few typical examples of macOS policies affecting scheduling, taken from the CPU history window of the Activity Monitor.

The diagram above shows a range of loads on the M1 chip from progressively more CPU-intensive threads. As mentioned above M1 has two clusters E and P0, each of which is divided here by a blue box. Starting from the left, the load of the 1st to 4th high priority processes is all taken up by cluster P0, while the load of the subsequent 5th to 8th processes is progressively taken up by cluster E to start with.

This graph, on the other hand, shows the changing load of M1 Pro under heavy load, with some threads being background processes and others being high priority processes, and while the vast majority of the load is carried by the E cluster, the P0 cluster also carries quite a bit of the load, while the P1 cluster is primarily used to handle some of the peak load.

The last figure shows the operation on the M1 Ultra, where the author himself has rearranged the corresponding cores into their corresponding clusters, with E is at the top, and P0 to P3 are arranged sequentially from left to right and top to bottom starting from the second row. The load shown in the figure is a very typical situation - the first few minutes after the system logs in, you can see that E and P0 carry the majority of the load, and in the early days when the load is heavier, the system schedules more tasks to the remaining 3 P clusters from P1 to P3 to complete the tasks faster.

Currently, the Activity Monitor does not provide one important piece of information about M-Series processors - the cluster frequency. With the CPU at 100% load, which is equivalent to active residency, the cluster completes instructions at frequencies below 1000MHz less than half as fast as the same cluster at 2064MHz. Unfortunately, the only means of obtaining frequency information is currently the command tool powermetrics.

Below is a summary of macOS management of the CPU cores in the M1, M1 Pro and Max chips. Information on the M1 Ultra is still being compiled and will be added as it becomes available. If you are using the M1 Ultra, are familiar with it, and would like to help, please feel free to contact the author, Dr. Howard Oakley.

This June, Apple will (probably) announce the successor to its M1 series at WWDC. At that time, it may be possible to see their core architecture and the management policies offered by macOS.

Thanks to Walt for the info on the Ultra and the screenshots under load.

How Does MacOS Manage With So Many Cores In The M1 CPU?

About Dr. Howard Oakley

Asymmetric processor architecture

For example, Logic Pro importing material would be an extreme case of how thread control works

Background threads (Background threads)

User initiated threads (User threads)

This uncommon activity resides all the way through the 30 minutes it takes to prepare to install the update. Patterns under load (Patterns under load)

Related articles

Minimalism Is Boring.

Google Is Angry At Apple And Will Be Sidelined If You Don't Use The IPhone?

Issey Miyake Dies, The Man Who Designed Steve Jobs' Black Sweater Leaves Behind A Timeless Piece Of Cloth

The New IPad Pro Interface Is Getting A Major Overhaul, And The Wonderfully Controlled Keyboard Will Be Updated With It

Top Articles

TAG

# Apple

# China

# Products

# Phone

# Google

# App

# Musk

# Microsoft

# Twitter

# AI

# AR

# Health

# Amazon

# macOS