Big Booms Montage Video...Curated by Short-Time Fourier Transforms?!?! - SuperTuxKart

time5 mo agoview16 views

I used Short-Time Fourier Transforms (i.e., STFTs) to highlight the mayhem of the last 4 months of racing, and I edited the entire video exclusively from the Manjaro command line. Here are the technical characteristics of the process:

  1. 63 source videos were processed, with a total duration of 1 day, 13 hours, 44 minutes and 44 seconds.
  2. Those files represent 4 months of racing footage, mostly from Grand Prix events.
  3. There were 1,987 separate files (i.e., scenes) extracted from the source files found by using STFTs, for a total duration of 1 hour, 26 minutes and 18 seconds after they were merged.
  4. The longest part of the process was the disk I/O to convert the video files into mono audio files, even though the temporary files were being written to "/tmp".
  5. The process to generate the montage only took a few hours.
  6. I watched the entire montage video, and it's action-packed and LOUD!!! 😎️
  7. On very rare occasions, the STFT matched a lot of rumble, and in a couple cases it matched a really, really LOUD kart! It also matched thunder, me crashing into gums, bananae, and sometimes bowling balls, sometimes lots of karts crashing, and in one spot, a 4-way swatter battle! I'm leaving it all in because now I can also claim that it's "A Critical Analysis on the Feasibility of Algorithmic Montage Video Editing Through the Use Of Short-Time Fourier Transforms," and that it's the dissertation of my PhD thesis :-p Just don't tell anyone that the real reason I edited the entire thing from the command line using Short-Time Fourier Transforms is because I'm too stupid to learn how to edit video properly by using a GUI video editing program...and too lazy 🤣️ I also like to think a smidgen outside the box :-p OK, maybe make that 50 smidgens :-p
  8. Only the upper 99.5% of matches per source video file were selected for inclusion. Using a percentile threshold makes the most sense and is easy to understand, so I used that.
  9. If there happened to be a 3-second or less gap between any of the matches, those matches were coalesced into one time range. Each time range forms a scene in the montage video.
  10. To form the time range, 0.5 seconds before and 1.5 seconds after the detected time range were included for extra context. A time range could be as short as a single point in time, so the preroll and postroll make for a minimum time range of 2.0 seconds. In other words, the shortest scenes were 2.0 seconds long. If lots of explosions occurred, the scene could be several seconds long.
  11. When a race ends, an ending AI takes over the kart and continues slowly driving around the track, with the character either celebrating for finishing in the top half, or the character is sad. When they crash into bananas, those karts can pick up bombs just like normal racing karts. If a normal racing kart finishes with a bomb, that bomb remains on the kart and still counts down. Also, said bombs can still explode. The bombs can explode either by counting down, the kart runs into a banana, or 2 karts, each with bombs, run into each other (an event I like to call a "Prost!" 🤣️). Even the finished karts that exploded after the race were included in the montage because it's funny 🤭️ So yes, sometimes there were after-race shenanigans, and those were included 😎️ I like to call those "celebratory wrecks" 🤣️
  12. I was trying to find and isolate all occurrences of the reference explosion sound from the video files and use those clips, so I'd make an explosion montage video. The process was a bit more creative than that, and it included some other things, too, including race starts, race ends, and rumble. I included it all, because science!!!
  13. This description wasn't written by AI, because AI is far too sensible! I'm teaching it how to be silly, though, and it's learning 🤭️
  14. The Short-Time Fourier Transform code was written by OpenAI's ChatGPT o1-preview model, after I gave it a rather extensive and detailed prompt...and it was one hell of a prompt!!! That code is written in Python.
  15. I wrote a bunch of weird wrapper code in Ruby to recursively process every file in a directory, store the results in an SQLite database, and a wrapper around the command line FFmpeg tool to extract the relevant scenes.
  16. I then used mkvmerge to merge all the files into one continuous video.
  17. I thoroughly enjoyed the madness, I mean cough cough, vetted the video, using mpv from the command line.
  18. At some point, I plan on making and releasing such a tool as open source, preferably all in Python, because Python seems to be the weapon of choice these days. I still enjoy Ruby, though :-p The tool has to be highly refined, user-friendly, and include a bunch more features to be truly amazing, though. I have some interesting ideas 😎️ If ChatGPT generates the tool, then the tool will also be open-prompt. Why? Because why the hell not! 😎️
  19. So, in other words, I think everybody should allow Short-Time Fourier Transforms to curate their montage videos...or a bunch of other digital signal processing techniques. Why? Because Make Algorithms Great Again! 🤣️
  20. Shoutout to other digital signal processing techniques that I tried, too; here are my notes:

Short-Time Fourier Transform: The best!!! Non-negative Matrix Factorization: Really, really creative and interesting Constant-Q Transform: Really damn good Mel-Spectral Frequency Decomposition: Good for creative matching; also, the name sounds awesome 😎️ Mel-Frequency Linear Predictive Coefficients: Quite creative Mel-Frequency Cepstral Coefficients: OK Discrete Wavelet Transform: slow: It matches tire squeals, karts crashin', basketballs bouncin', yeets, running into bananae and gums, and of course a few explosions. Huh??? And yes, sometimes in the SuperTuxKart community, we say "bananae" instead of "bananas"..."because yes" Independent Component Analysis: Matches the start of each race?!?! Also, yeah, kinda weird

DISQUALIFIED: Cohen's Class Distributions method using the Choi-Williams distribution variant: BUSTED; apparently, the libraries included no longer provide the required functionality or the names changed for weird reasons. ChatGPT should be mad about this. In reality, it'll probably just rewrite the code with a "Certainly!" 🤣️ Continuous Wavelet Transform using the Mexican Hat variant: CRASHES due to OOM; takes a very long time to crash, too. Gammatone Cepstral Coefficients: CRASHES due to OOM; takes a long time to crash, too.

Also, "CRASHES due to OOM" basically translates to "it crashed because it ran out of at least 50 GiB of physical memory 🤯️"

Also, "Applying LPC to MFCCs is not a standard practice, but it serves as an approximation for the purpose of this implementation" - ChatGPT o1-preview In other words, MF-LPC is a Frankenstein's monster!

Loading comments...