Parse files in parallel when possible#21175
Parse files in parallel when possible#21175ilevkivskyi wants to merge 10 commits intopython:masterfrom
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
mypy/build.py
Outdated
| # TODO: we should probably use psutil instead. | ||
| # With psutil we can get a number of physical cores, while all stdlib | ||
| # functions include virtual cores (which is not optimal for performance). | ||
| available_threads = os.cpu_count() or 2 # conservative fallback |
There was a problem hiding this comment.
Yes, len(psutil.Process().cpu_affinity()) is better everywhere except Darwin/macOS, where psutil doesn't support that; though I still suggest taking the minimum of that and os.cpu_count() as the later respects -X cpu_count and/or PYTHON_CPU_COUNT for Python 3.13+ users (especially containerized users).
If you don't want to add a psutil dependency yet, I recommend os.sched_getaffinity(0) which is how os.process_cpu_count() is implemented on Python 3.13+. (you should also still call os.cpu_count() and use it if it is smaller, for the same reasons as above).
There was a problem hiding this comment.
Yeah, I tried sched_getaffinity() but it is not available on Python 3.10 (which we still support). I guess we may need to write a separate helper with various fallbacks logic to make this ~reliable.
There was a problem hiding this comment.
Huh, I see os.sched_getaffinity() all the way back to Python 3.3: https://docs.python.org/3.3/library/os.html#os.sched_getaffinity
However,
They are only available on some Unix platforms
So maybe your platform didn't implement it until a later Python version.
Yeah, helper function + memoization is very helpful here
This comment has been minimized.
This comment has been minimized.
|
OK, as suggested by @JukkaL I did some more investigations. By measuring just the parsing time in isolation it looks like the overhead is coming from small non-parallelizeable portion of I updated the comment to reflect this, and also added a more robust CPU count helper as suggested before. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
It looks like I need to apply |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
I'm getting this error when trying to compile mypy on macOS: |
OK, I will replace the try/except with |
mypy/util.py
Outdated
| cpu_count = psutil_cpu_count or os_cpu_count | ||
| else: | ||
| # A conservative fallback in case we cannot determine CPU count in any way. | ||
| cpu_count = 2 |
There was a problem hiding this comment.
I think we can use a higher number as the fallback, since even most consumer hardware these days has at least 4 cores. Maybe use 6, or at least 4? In CI the number of CPUs may be more limited, but CI is most commonly using Linux, and there we can determine the CPU count. Also, if --num-workers N is used, we can use at least N threads.
| for key, value in sorted(self.stats_summary().items()): | ||
| print(f"{key + ':':24}{value}") | ||
|
|
||
| def parse_all(self, states: Iterable[State]) -> None: |
There was a problem hiding this comment.
Iterable is somewhat questionable as a type, since we iterate over states more than once. Container[State] could be a slightly better type.
| def parse_file_inner(self, source: str, raw_data: FileRawData | None = None) -> None: | ||
| t0 = time_ref() | ||
| self.manager.log(f"Parsing {self.xpath} ({self.id})") | ||
| with self.wrap_context(): |
There was a problem hiding this comment.
Error context and the Errors object seems to be shared between threads, so there is a race condition. We could perhaps use a separate Errors per thread as a quick fix.
A more efficient approach would be to have parse just return a list of error descriptions, and then the main thread would report any errors via the shared Errors instance. This way we wouldn't need to set up error context for each file unless there are errors (which is rare), so the sequential part would do less work, possibly increasing the maximum possible concurrency.
It's okay to do something simple at first. We can always improve it later, but it would be good to fix race conditions before merging.
Also in BuildManager.parse_file, the add_stats call is racy. In this case we could add a stats_enabled check to quickly skip stats reporting most of the time (again reducing sequential part), and then use a lock around stats updates if stats are enabled.
There was a problem hiding this comment.
Yes, indeed there are race conditions. But IIUC the extra wrap_context() calls I added are needed anyway. Without those we can get incorrect error ordering even in sequential mode in some edge cases.
|
According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅ |
| available_threads = max(get_available_threads(), self.options.num_workers) | ||
| # Overhead from trying to parallelize (small) blocking portion of | ||
| # parse_file_inner() results in no visible improvement with more than 8 threads. | ||
| with ThreadPoolExecutor(max_workers=min(available_threads, 8)) as executor: |
There was a problem hiding this comment.
Reusing the ThreadPoolExecutor could also help scaling a bit, based on some quick microbenchmarks (this is a potential future improvement).
| # the side effect of parsing inline mypy configurations. | ||
| state.get_source() | ||
| if state.id not in self.ast_cache: | ||
| futures.append(executor.submit(state.parse_file_inner, state.source or "")) |
There was a problem hiding this comment.
Another potential future improvement/experiment: sending multiple small files in a single submit call could improve scaling a bit, but likely only for small files or if the batch is large enough.
|
@JukkaL thanks for comments! I will take a look at them later today or tomorrow focusing on correctness issues, some of the performance things I am going to leave for a follow up PR. |
The idea is simple: new parser doesn't need the GIL, so we can parse files in parallel. Because it is tricky to apply parallelization only to parallelizeable code, the most I see is ~4-5x speed-up with 8 threads, if I add more threads, it doesn't get visibly faster (I have 16 physical cores).
Some notes on implementation:
ThreadPoolExecutor, it seems to work OK.parse_file()a bit, so that we can parallelize (mostly) just the actual parsing. I see measurable degradation if I try to parallelize all ofparse_file().psutilbecause it is an optional dependency. We may want to actually make it a required dependency at some point.ast_serializeto beNonesometimes in some threads. I simply add an ugly workaround for now.cc @JukkaL