As I have mentioned in previous posts, I am working on a side project where I work with podcast feeds. Part of the idea is being able to act as a sort of middleware for podcasts. Give the application your podcasts, and you will be able to do things like search transcripts; and, if a podcast has a long intro on each episode, remove that. Searching for where someone mentioned a thing in an episode can be hard to research by just scrolling through an episode; if computer transcripts are 95% effective, that is better than nothing. For trimming podcasts, I find when you go through a big backlog of podcasts, hearing the same 1-minute intro every 30 minutes takes a good chunk of time.
Being that I use Java a lot for work, and am comfortable there, I wanted to make this project with Dropwizard and React. The first bit of the project has been working on the audio recognition engine, which will be a whole post in and of itself. After that I needed to start getting the supporting libraries I wanted. I tend to try to make as much of my code native to the language I am using as possible. That means we want to do as much in Java itself as possible. There are a ton of libraries that call out to FFMPEG or a command line app to handle the feeds; I don’t want to do that. If a side effect of this project is helping the community and writing some additional libraries, that is a win in my book.
The first library I needed was a library to read AND write podcast feeds. With this app being middleware, we need to be able to do both. I found MarkusLewis‘s Podcast-Feed-Library, this works great for reading feeds in, but does not support writing. I took a look at his library and architecture a library similarly, except added the ability to get your feed object and then write it out again. In the end I made https://github.com/daberkow/PodcastFeedHandler. This library is written entirely in Java with no dependencies. Using Java 11, I can have all the native XML parsing I need. The rest of my project is in Java 17, but I thought others may find 11 useful. I am not sure there are any fragments in the library that wouldn’t allow me to go lower, except its 2023, and 11 is an older LTS at this point. An exciting part of that sub-project was getting Maven publishing working. Now I can publish for my domain of ntbl.co.
This project also got me used to using Github Actions. I have used CircleCI before but thought I would try Github Actions as they give you unlimited runtime for public repositories. Thanks Microsoft! I have the library build, get signed, and upload via Actions. I wanted to make sure the library preformed as I wanted and reached out to JetBrains to get an Open-Source maintainer license for Intellij. They kindly approved me!
The next part of the project was parsing and fingerprinting the audio to search for duplicate segments. I will get more into that at a later time. To be able to fingerprint I needed the WAV/PCM format of the audio. Podcasts tend to be MP3 or AAC files. There are a ton of libraries to convert media in Java, except most of them had a FFMPEG external dependency. That is something I wanted to avoid. By having 100 percent native code, I can more easily create the workers that will handle these duties. Anywhere Java can run, they can run on or be compiled to; instead of having external dependencies.
I found nwaldispuehl‘s java-lame, this is a copy of the fantastic native Java port of LAME; described as “This java port of LAME 3.98.4 was created by Ken Händel for his ‘jump3r – Java Unofficial MP3 EncodeR’ project: http://sourceforge.net/projects/jsidplay2/“. The library hasn’t been updated in a while but does everything I need. It can convert MP3s but needs a file location to be passed in before converting to a byte array. I do not want to have to write to disk. The workflow would be, download podcast, store on disk, read from disk, convert. We should be able to do all this in memory. Doing all these operations in memory also means the workers do not need a bunch of scratch disk space, which is a plus. It’s more memory intensive but cuts down on disk usage. In 2023, I would rather have a slightly more memory intensive application than be doing a ton of extra read/writes to SSDs.
Throughout this project I have been thinking of: if I use it a lot or have friends using the web app, and it is constantly reading and writing audio files, how can I minimize bottlenecks. I forked the Github repo for java-lame, then added in paths to allow in-memory MP3 feeding and processing. This allows me to add a S3 client to the workers, and natively work on those files without ever writing to disk.
This library has a bunch of more functionality than I am using. It was a full LAME port, including the command line system and processing. I am planning to remove that as I go to shrink the library. I also want to replace some of the core conversion to WAV/PCM into having it in-memory compression, and functions to handle chunking the files and processing them piece by piece.
I took a This American Life episode, 1 hour in length, 67MB as MP3. Converting it to WAV/PCM I needed created a 678MB file. About a 10x size difference. Compressing that data lossless-ly with standard ZIP compression got the file down to 437MB, about 65% the size the original WAV/PCM. I can retrieve the ZIP data as a stream, and being audio, I am not jumping around; thus, that works well for me. 678MB for a file doesn’t sound so bad, a worker then just needs 1GB of RAM or so to process it, right? My worry is other podcasts. Shows like Dan Carlin’s Hardcore History can easily be 5 hours per episode, that is a 200+MB MP3, and then would be 2-3 GB of RAM to process one episode. If I can take 35% off for relatively small compute overhead, I want to.
I will post more as I go through the project. If these libraries or the blog have helped anyone, feel free to drop a comment! I always appreciate it when people do.
(The photo is something I through together on Bing image creator, its Java with audio 😊 )