Starting this formal discussion we've alluded to in the past meetings...
Right now I'm starting to transcode all videos on EMBER and DANDI into a common encoding & resolution space
This approach ends up being multi-purpose:
- Facilitates more efficient web streaming
- This was the original motivation, given how many videos on DANDI simply cannot be played on the web
- A side effect is that random frame seeking through Pozu is performing much better than it was on the original videos. This would no doubt have a similar effect on the SLEAP web app
- Another side effect is reduced storage size
- I was able to take one chronic recording, 100 GB H.264 encoding in MP4 container, and reduce it down to 10 GB without super noticeable quality of rendering
- (you can maybe tell one is higher res if played side by side but doubtful it would affect coarse-grain pose estimation)
sleap reencode behaves very similarly to what I'm doing (with some modifications, I might be able to directly leverage it) but is more general in how it exposes flexible options:
- The MAJOR difference is
sleap reencode scales the video space (in ways it can be reversed; that is the pose labels in the lower res can theoretically be mapped back to the original pixel space) whereas mine is mostly non-reversible (technically for some cases the math could be done, but not generally)
My main scientific question at this point is stated as such: towards the goal of training species-specific foundation models, is it more important that the video data ought to be standardized into:
- a common resolution space (so that same pixel coordinates map across all training data)
- a common physical space (so that all pixels in the reencoded space represent the same physical scale; this no doubt requires additional metadata about centimeteres per pixel dimension in each video, which might possibly present in the videos themselves if items such as rulers are in the frame of view, or if the full arena is in view plus knowledge of the arena size)
- some combination of both (with padding or scaling used to facilitate)?
I suppose we might not know for sure unless we try all three, but wanted to get people thoughts on the matter at this point
Starting this formal discussion we've alluded to in the past meetings...
Right now I'm starting to transcode all videos on EMBER and DANDI into a common encoding & resolution space
This approach ends up being multi-purpose:
sleap reencodebehaves very similarly to what I'm doing (with some modifications, I might be able to directly leverage it) but is more general in how it exposes flexible options:sleap reencodescales the video space (in ways it can be reversed; that is the pose labels in the lower res can theoretically be mapped back to the original pixel space) whereas mine is mostly non-reversible (technically for some cases the math could be done, but not generally)My main scientific question at this point is stated as such: towards the goal of training species-specific foundation models, is it more important that the video data ought to be standardized into:
I suppose we might not know for sure unless we try all three, but wanted to get people thoughts on the matter at this point