Skip to content

Parse files in parallel when possible#21175

Open
ilevkivskyi wants to merge 10 commits intopython:masterfrom
ilevkivskyi:parallel-native-parse
Open

Parse files in parallel when possible#21175
ilevkivskyi wants to merge 10 commits intopython:masterfrom
ilevkivskyi:parallel-native-parse

Conversation

@ilevkivskyi
Copy link
Copy Markdown
Member

@ilevkivskyi ilevkivskyi commented Apr 5, 2026

The idea is simple: new parser doesn't need the GIL, so we can parse files in parallel. Because it is tricky to apply parallelization only to parallelizeable code, the most I see is ~4-5x speed-up with 8 threads, if I add more threads, it doesn't get visibly faster (I have 16 physical cores).

Some notes on implementation:

  • I use stdlib ThreadPoolExecutor, it seems to work OK.
  • I refactored parse_file() a bit, so that we can parallelize (mostly) just the actual parsing. I see measurable degradation if I try to parallelize all of parse_file().
  • I do not use psutil because it is an optional dependency. We may want to actually make it a required dependency at some point.
  • It looks like there is a weird mypyc bug, that causes ast_serialize to be None sometimes in some threads. I simply add an ugly workaround for now.
  • It looks like I need to apply wrap_context() more consistently now. A bunch of tests used to pass accidentally before.
  • I only implement parallelization in the coordinator process. The workers counterpart can be done after Split type-checking into interface and implementation in parallel workers #21119 is merged (it will be trivial).

cc @JukkaL

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

mypy/build.py Outdated
# TODO: we should probably use psutil instead.
# With psutil we can get a number of physical cores, while all stdlib
# functions include virtual cores (which is not optimal for performance).
available_threads = os.cpu_count() or 2 # conservative fallback
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, len(psutil.Process().cpu_affinity()) is better everywhere except Darwin/macOS, where psutil doesn't support that; though I still suggest taking the minimum of that and os.cpu_count() as the later respects -X cpu_count and/or PYTHON_CPU_COUNT for Python 3.13+ users (especially containerized users).

If you don't want to add a psutil dependency yet, I recommend os.sched_getaffinity(0) which is how os.process_cpu_count() is implemented on Python 3.13+. (you should also still call os.cpu_count() and use it if it is smaller, for the same reasons as above).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I tried sched_getaffinity() but it is not available on Python 3.10 (which we still support). I guess we may need to write a separate helper with various fallbacks logic to make this ~reliable.

Copy link
Copy Markdown
Contributor

@mr-c mr-c Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I see os.sched_getaffinity() all the way back to Python 3.3: https://docs.python.org/3.3/library/os.html#os.sched_getaffinity

However,

They are only available on some Unix platforms

So maybe your platform didn't implement it until a later Python version.

Yeah, helper function + memoization is very helpful here

@github-actions

This comment has been minimized.

@ilevkivskyi
Copy link
Copy Markdown
Member Author

OK, as suggested by @JukkaL I did some more investigations. By measuring just the parsing time in isolation it looks like the overhead is coming from small non-parallelizeable portion of parse_file_inner(). By extracting purely parallelizeable part we can potentially win another ~30ms (i.e. ~2%). This however will make code uglier, so although we can do this at some point, it doesn't look like it is worth it ATM.

I updated the comment to reflect this, and also added a more robust CPU count helper as suggested before.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@ilevkivskyi
Copy link
Copy Markdown
Member Author

It looks like I need to apply wrap_context() more consistently now. A bunch of tests used to pass accidentally before. I will update PR description to mention this.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@JukkaL
Copy link
Copy Markdown
Collaborator

JukkaL commented Apr 9, 2026

I'm getting this error when trying to compile mypy on macOS:

mypy/util.py:1004: error: "Process" has no attribute "cpu_affinity"  [attr-defined]

@ilevkivskyi
Copy link
Copy Markdown
Member Author

I'm getting this error when trying to compile mypy on macOS

OK, I will replace the try/except with if sys.platform != "darwin" then.

mypy/util.py Outdated
cpu_count = psutil_cpu_count or os_cpu_count
else:
# A conservative fallback in case we cannot determine CPU count in any way.
cpu_count = 2
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use a higher number as the fallback, since even most consumer hardware these days has at least 4 cores. Maybe use 6, or at least 4? In CI the number of CPUs may be more limited, but CI is most commonly using Linux, and there we can determine the CPU count. Also, if --num-workers N is used, we can use at least N threads.

for key, value in sorted(self.stats_summary().items()):
print(f"{key + ':':24}{value}")

def parse_all(self, states: Iterable[State]) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterable is somewhat questionable as a type, since we iterate over states more than once. Container[State] could be a slightly better type.

def parse_file_inner(self, source: str, raw_data: FileRawData | None = None) -> None:
t0 = time_ref()
self.manager.log(f"Parsing {self.xpath} ({self.id})")
with self.wrap_context():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error context and the Errors object seems to be shared between threads, so there is a race condition. We could perhaps use a separate Errors per thread as a quick fix.

A more efficient approach would be to have parse just return a list of error descriptions, and then the main thread would report any errors via the shared Errors instance. This way we wouldn't need to set up error context for each file unless there are errors (which is rare), so the sequential part would do less work, possibly increasing the maximum possible concurrency.

It's okay to do something simple at first. We can always improve it later, but it would be good to fix race conditions before merging.

Also in BuildManager.parse_file, the add_stats call is racy. In this case we could add a stats_enabled check to quickly skip stats reporting most of the time (again reducing sequential part), and then use a lock around stats updates if stats are enabled.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed there are race conditions. But IIUC the extra wrap_context() calls I added are needed anyway. Without those we can get incorrect error ordering even in sequential mode in some edge cases.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

available_threads = max(get_available_threads(), self.options.num_workers)
# Overhead from trying to parallelize (small) blocking portion of
# parse_file_inner() results in no visible improvement with more than 8 threads.
with ThreadPoolExecutor(max_workers=min(available_threads, 8)) as executor:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reusing the ThreadPoolExecutor could also help scaling a bit, based on some quick microbenchmarks (this is a potential future improvement).

# the side effect of parsing inline mypy configurations.
state.get_source()
if state.id not in self.ast_cache:
futures.append(executor.submit(state.parse_file_inner, state.source or ""))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another potential future improvement/experiment: sending multiple small files in a single submit call could improve scaling a bit, but likely only for small files or if the batch is large enough.

@ilevkivskyi
Copy link
Copy Markdown
Member Author

@JukkaL thanks for comments! I will take a look at them later today or tomorrow focusing on correctness issues, some of the performance things I am going to leave for a follow up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants