Parse files in parallel when possible by ilevkivskyi · Pull Request #21175 · python/mypy

ilevkivskyi · 2026-04-05T23:01:11Z

The idea is simple: new parser doesn't need the GIL, so we can parse files in parallel. Because it is tricky to apply parallelization only to parallelizeable code, the most I see is ~4-5x speed-up with 8 threads, if I add more threads, it doesn't get visibly faster (I have 16 physical cores).

Some notes on implementation:

I use stdlib ThreadPoolExecutor, it seems to work OK.
I refactored parse_file() a bit, so that we can parallelize (mostly) just the actual parsing. I see measurable degradation if I try to parallelize all of parse_file().
I do not use psutil because it is an optional dependency. We may want to actually make it a required dependency at some point.
It looks like there is a weird mypyc bug, that causes ast_serialize to be None sometimes in some threads. I simply add an ugly workaround for now.
It looks like I need to apply wrap_context() more consistently now. A bunch of tests used to pass accidentally before.
I only implement parallelization in the coordinator process. The workers counterpart can be done after Split type-checking into interface and implementation in parallel workers #21119 is merged (it will be trivial).

cc @JukkaL

mr-c · 2026-04-06T11:52:57Z

mypy/build.py

+            # TODO: we should probably use psutil instead.
+            # With psutil we can get a number of physical cores, while all stdlib
+            # functions include virtual cores (which is not optimal for performance).
+            available_threads = os.cpu_count() or 2  # conservative fallback


Yes, len(psutil.Process().cpu_affinity()) is better everywhere except Darwin/macOS, where psutil doesn't support that; though I still suggest taking the minimum of that and os.cpu_count() as the later respects -X cpu_count and/or PYTHON_CPU_COUNT for Python 3.13+ users (especially containerized users).

If you don't want to add a psutil dependency yet, I recommend os.sched_getaffinity(0) which is how os.process_cpu_count() is implemented on Python 3.13+. (you should also still call os.cpu_count() and use it if it is smaller, for the same reasons as above).

Yeah, I tried sched_getaffinity() but it is not available on Python 3.10 (which we still support). I guess we may need to write a separate helper with various fallbacks logic to make this ~reliable.

Huh, I see os.sched_getaffinity() all the way back to Python 3.3: https://docs.python.org/3.3/library/os.html#os.sched_getaffinity

However,

They are only available on some Unix platforms

So maybe your platform didn't implement it until a later Python version.

Yeah, helper function + memoization is very helpful here

ilevkivskyi · 2026-04-08T12:12:15Z

OK, as suggested by @JukkaL I did some more investigations. By measuring just the parsing time in isolation it looks like the overhead is coming from small non-parallelizeable portion of parse_file_inner(). By extracting purely parallelizeable part we can potentially win another ~30ms (i.e. ~2%). This however will make code uglier, so although we can do this at some point, it doesn't look like it is worth it ATM.

I updated the comment to reflect this, and also added a more robust CPU count helper as suggested before.

ilevkivskyi · 2026-04-08T13:29:36Z

It looks like I need to apply wrap_context() more consistently now. A bunch of tests used to pass accidentally before. I will update PR description to mention this.

…arse

JukkaL · 2026-04-09T11:05:41Z

I'm getting this error when trying to compile mypy on macOS:

mypy/util.py:1004: error: "Process" has no attribute "cpu_affinity"  [attr-defined]

ilevkivskyi · 2026-04-09T11:07:56Z

I'm getting this error when trying to compile mypy on macOS

OK, I will replace the try/except with if sys.platform != "darwin" then.

JukkaL · 2026-04-09T11:20:19Z

mypy/util.py

+        cpu_count = psutil_cpu_count or os_cpu_count
+    else:
+        # A conservative fallback in case we cannot determine CPU count in any way.
+        cpu_count = 2


I think we can use a higher number as the fallback, since even most consumer hardware these days has at least 4 cores. Maybe use 6, or at least 4? In CI the number of CPUs may be more limited, but CI is most commonly using Linux, and there we can determine the CPU count. Also, if --num-workers N is used, we can use at least N threads.

JukkaL · 2026-04-09T11:54:18Z

mypy/build.py

            for key, value in sorted(self.stats_summary().items()):
                print(f"{key + ':':24}{value}")

+    def parse_all(self, states: Iterable[State]) -> None:


Iterable is somewhat questionable as a type, since we iterate over states more than once. Container[State] could be a slightly better type.

JukkaL · 2026-04-09T12:03:54Z

mypy/build.py

+    def parse_file_inner(self, source: str, raw_data: FileRawData | None = None) -> None:
+        t0 = time_ref()
+        self.manager.log(f"Parsing {self.xpath} ({self.id})")
+        with self.wrap_context():


Error context and the Errors object seems to be shared between threads, so there is a race condition. We could perhaps use a separate Errors per thread as a quick fix.

A more efficient approach would be to have parse just return a list of error descriptions, and then the main thread would report any errors via the shared Errors instance. This way we wouldn't need to set up error context for each file unless there are errors (which is rare), so the sequential part would do less work, possibly increasing the maximum possible concurrency.

It's okay to do something simple at first. We can always improve it later, but it would be good to fix race conditions before merging.

Also in BuildManager.parse_file, the add_stats call is racy. In this case we could add a stats_enabled check to quickly skip stats reporting most of the time (again reducing sequential part), and then use a lock around stats updates if stats are enabled.

Yes, indeed there are race conditions. But IIUC the extra wrap_context() calls I added are needed anyway. Without those we can get incorrect error ordering even in sequential mode in some edge cases.

github-actions · 2026-04-09T12:18:59Z

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

JukkaL · 2026-04-09T13:05:18Z

mypy/build.py

+            available_threads = max(get_available_threads(), self.options.num_workers)
+            # Overhead from trying to parallelize (small) blocking portion of
+            # parse_file_inner() results in no visible improvement with more than 8 threads.
+            with ThreadPoolExecutor(max_workers=min(available_threads, 8)) as executor:


Reusing the ThreadPoolExecutor could also help scaling a bit, based on some quick microbenchmarks (this is a potential future improvement).

JukkaL · 2026-04-09T13:08:31Z

mypy/build.py

+                    # the side effect of parsing inline mypy configurations.
+                    state.get_source()
+                    if state.id not in self.ast_cache:
+                        futures.append(executor.submit(state.parse_file_inner, state.source or ""))


Another potential future improvement/experiment: sending multiple small files in a single submit call could improve scaling a bit, but likely only for small files or if the batch is large enough.

ilevkivskyi · 2026-04-09T14:18:53Z

@JukkaL thanks for comments! I will take a look at them later today or tomorrow focusing on correctness issues, some of the performance things I am going to leave for a follow up PR.

ilevkivskyi added 2 commits April 6, 2026 00:00

Parse files in parallel when possible

8449bb5

Fix 3.10

0f160f8