The Power of Public Media

Published: 7 years ago

Adventures in HTML 5 Sound: A Developer’s Journey

by Bharat Battu, @bharat_battu

(Too much? Here's the TL;DR summary)
Updated 5/7/2014: Loading Time Tests
DISCLAIMER: I’m not an expert in web audio by any means. My findings and insights below are based purely from my own experiences in dealing with sound in my own HTML 5 projects. I am not claiming them to be guaranteed, vigorously-tested results and suggestions that anyone should adhere to. These are simply the notes, musings, and opinions of one developer of children’s educational games, making a switch from the comfy silo of Flash to the frontiers of the wild wild web. Comments, questions, concerns, corrections, outrages? Sound off in the comments ☺

Now well into 2014, it’s not surprising that like many other producers of interactive web content, my game development has migrated from Flash to HTML 5. The landscape in web development has shifted to predominantly favor ease, convenience, and effortlessness for users: to be able to experience web content instantly and consistently, on whichever device they happen to be using, wherever they are, and without having to install and maintain an external plug-in. HTML 5 allows rich interactive content (like games) to be created using javascript and run directly in-browser, thanks to the capabilities baked into modern web browsers. Since winter 2013, all of the games I’ve developed for WGBH Digital’s kids group have been in HTML 5, relying mostly on the CreateJS suite of libraries, which allow the creation of robust interactives via HTML 5 canvas. The suite has handled the rendering, tweening (animation), and preloading of assets in my interactives simply and mostly consistently across browsers on both desktop and mobile. Other HTML 5 game devs may have other frameworks or technologies they prefer, but I assume I am not alone in claiming that getting graphics to display and do things on a webpage doesn’t take the crown for Biggest HTML 5 Headache.

I award that honor to sound. I’ve found that sound with HTML 5 hasn’t been nearly as simple as picking a framework or approach and having it work perfectly on the first try everywhere you need it to. There is a lot of variation in how different browsers and devices manage audio, and my experience in building five different HTML 5 interactives over the past 15 months has allowed me to explore and experiment across several particular aspects of handling audio playback. I’ve encountered some notable quirks and pitfalls along the way, and it’s certainly been enlightening. I’ve come to acknowledge that HTML 5 sound is a lot more than “”, which is essentially how easy it was with Flash.  Here is a chronicle of my journey so far, with key thoughts and takeaways I’ve gathered from each project:

“Never done this before, let’s see what happens!”

“Never done this before, let’s see what happens!”

1) Martha’s True Stories (Feb. – Aug. 2013):

For my first HTML 5 project, Martha’s True Stories, I settled with using SoundManager2 (SM2) for handling audio playback. Why? Because at the time we began development (Feb. 2013), although it wasn’t the only sound framework I found that was being actively developed, used on the web, and known to work across desktop and mobile, it offered a feature that was necessary for this project in particular: a built-in event dispatcher that regularly fired off the current time position of the actively playing sound. An implementation like this was essential for word-by-word highlighting, a required feature for the 12 narrated stories we were building.

SM2 provided its HTML 5 audio playback via the “audio tag” approach (creating audio tag elements directly in the page HTML for each sound file being used). It supported the use of a single audio “sprite”- a single audio file that contains all of the different sounds available for playback. Distinct sounds are played by seeking to identified time positions within the audio sprite Kind of like rewinding and fast-forwarding a cassette tape to reach a song you want to hear, the process is not precise and isn’t instantaneous (issues that became really problematic in a later game, see Build-a-Bot below).

Key findings and takeaways from my very first HTML 5 project:

-Audio playback on mobile is only possible if the audio’s loading is initiated in direct response to a “touch event”. This may seem like a straightforward technical limitation, but it has altered the way we’ve thought about structuring out the user experience for HTML 5 interactives on mobile. The constraint resulted in these stories being designed with intro title screens that users had to tap to begin the loading process (instead of having title audio heard immediately while loading progresses automatically on opening the page, as was custom with earlier Flash games for Martha Speaks). For the stories, we settled on requiring a click/tap to start sound loading on all platforms (to accommodate the mobile-specific restriction).

Creating audio sprites by hand. *bangs head against wall*

-Using audio sprites with SM2 was beneficial as the combination allowed us to know the active word while a sentence was being read, thanks to SM2’s awareness of current position in the playing sound. However, this required us to identify the start and end timestamps for every word in every sentence in all of our 12 stories. At the time of production, this was a manual process and major time-sink: involving combining distinct pieces of audio into a single file in the audio program Audacity (with adequate spacing between pieces), and then applying labels for each piece, and finally migrating the exported labels and time values into our code.  This was an extremely laborious process that had to be done multiple times per story, when considering that the regular additions/removals/swaps made to scripts and sound effects over the course of production meant frequently having to modify audio sprites, readjust labels, and re-export new timestamps into code. Certainly a less than ideal workflow that made me hate working with audio sprites, and long for something better…(see “Work Smarter, Not Harder” below)

“Gotta get this out the door… take what worked before!”

“Gotta get this out the door… take what worked before!”

2) Pumpkin Boo! (Sept. – Oct. 2013):

Despite the headaches associated with audio sprite work, the SM2 implementation for Martha’s True Stories worked. When super-quick production on Pumpkin Boo! (a Curious George game) was set to begin shortly after, I relied on the same sound backbone of SM2 + audio sprites that drove the stories so focus could be directed at other aspects of the game.

Pumpkin Boo! had different audio needs than Martha’s True Stories (a simple action game vs. narrated stories with word-by-word highlighting), and reusing the SM2 implementation belonging to the first project raised different issues. The work of creating and using an audio sprite for Pumpkin Boo! was not as cumbersome (though still a lengthy process done manually), but the use of SM2 as it was implemented for True Stories meant only one sound to could play and be heard at any time. Enabling simultaneous playback of different sounds from a single audio sprite source didn’t appear to be possible in the SM2 implementation as-is (or, at least implementable in the short timeframe we had), which was unfortunate for a game containing sound effects and narrator audio that are triggered anytime as a result of spontaneous player actions. This meant a newly triggered sound would always cut off any other sound already being played. This was definitely a sore point, which made the simultaneous playback of multiple sounds a sticking point for future action-type games.

However, one improvement Pumpkin Boo! was able to incorporate over its predecessor was differentiating the initial user experience in terms of loading. Simple device/platform detection allowed us to have desktop users have the game load and begin instantly on page load (the ideal experience), while mobile users were presented with a tappable title screen to ensure the game and its sound loaded directly in response to a touch event, as required.

“OK, what might’ve worked before is reaching its limits…”

“OK, what might’ve worked before is reaching its limits…”

3) Build-a-Bot (Dec. 2013 – Feb. 2014):

Moving on to building my third HTML 5 game, my natural inclination was to build upon what I had utilized in the past and do even more with it. With Build-a-Bot, we were creating a game that had players creating different kinds of robots by attaching various heads, arms, and legs onto a body base. The assortment of body parts was limited to six of each part type, but even that resulted in lots of possible robot permutations in the game’s freeplay “sandbox” mode. The game’s script had the narrator task players with building robots that could accomplish certain tasks, and then had the narrator announce what the player’s specific build was doing when activated to run. This meant the game’s audio was going to contain lots of sentences starting with consistent, repeated starting phrases (such as “Help George build a robot that can…” or “That robot can…”) followed by concluding phrases that described the action being done (“… hammer a nail!”).

In the hopes of improving load times, my idea for this game was to minimize the amount of dialogue audio to have to add to the audio sprite. Why bulk up the audio sprite by including several complete sentences of audio that begin with the same intro phrases, when we can try piecing together single instances of starting phrases with appropriate conclusion dialogue? A large quantity of full sentences in the audio sprite could instead be replaced with a handful of unique starting and ending blocks of narration that can be stitched together dynamically as needed in-game! With this in mind, we recorded the game’s narrator (the Man With the Yellow Hat) reading the starting stems of these sentences just once, and then had him read each action-specific concluding stem one time only. This may have reduced the size of the game’s audio sprite, but it introduced other problems that left me needing to rethink my practices for working with HTML 5 audio.

As I mentioned earlier, using a single audio sprite for playback of all of a game’s sound assets is akin to working with an audio cassette tape. Seeking around the sprite is an imprecise, non-instantaneous operation. In Build-a-Bot’s attempt to create complete narrated sentences by playing different sections of the audio sprite in succession, my original goal was to make things as seamless as possible by having conclusions play immediately after a starting stems finished being read. This didn’t work as I’d hoped. Initially, this kind of rapid seeking and sequenced calls for playback within the audio sprite with SM2 had the game’s audio simply break on mobile devices—after a few rounds of play the audio would just lose control and start playing the entirety of the audio sprite from its beginning. This was an unacceptable game-breaker, and it ultimately took significant additional work and testing to get the game working with the previously used SM2 implementation. The downside was I had to add small time delays between the firings of each sound playback in every dynamically created sentence. This ensured the audio backend didn’t trip on itself, but it unfortunately resulted in noticeable gaps in the structural pieces of narrated sentences in-game (you can hear it for yourself: “That robot can… catch a ball… and… brush teeth”). This shortcoming, plus the reoccurrence of the inability to play separate pieces of audio simultaneously, made changing my approach to HTML 5 sound an absolute must for my next projects.

Martha's SteaksNight Light

“Handling sound BETTER is priority one!”

4) Martha’s Steaks (Dec. 2013 – Mar. 2014):
5) Night Light* (Mar. 2014 – present)

(*Launching soon, links to be provided)

Coming into a new year and new projects, my colleagues and I wanted to make strides in the implementation of HTML 5 sound in light of the shortcomings from previous projects. I decided to shed SoundManager2 and switch to a different sound framework- not just to get past the issues discovered in my last projects, but to also get away from using audio sprites (at least at first…) and their laborious production process. For two new games, Martha’s Steaks and Night Light, my goal was to focus on and utilize SoundJS, the sound library that is already part of the CreateJS suite I use for the other facets of my game construction. Getting up and running with SoundJS was straightforward, and I appreciated how it integrated nicely with the preloading I was already accustomed to with PreloadJS. And it seemed to work better than SM2, at least as I was comparing the implementations of audio in these new games with my three prior projects.

Benefits I found included:

  • Use of multiple audio file assets: every distinct sound effect or line of dialogue could be delivered and used as its own audio file. I was freed from the shackles of painstakingly creating and RE-creating audio sprites and timecode data!
  •  SoundJS’s documentation made it clear and easy to play back simultaneous sounds! “SoundInstance” objects essentially act as discrete sound channels. Creating one specific SoundInstance for narrator audio and another for sound effects (and yet another for ambient looping background noise in the case of Night Light) allowed different kinds of sounds to play as needed, without having to abruptly cut off playing sounds of a different sort.
  • General setup (configuration, instantiation) of SoundJS was cleaner and simpler than SM2. It made it easy to designate audio playback via three different APIs (WebAudio, HTML audio tags, and Flash), including prioritization for automatic fallbacks based on browser capabilities. The more modern (but less backward compatible) WebAudio API seems to perform and work better than the more compatible HTML audio tag method. WebAudio allows a greater number of simultaneous audio plays and (perceivably) appeared to load audio quicker. So, WebAudio is the prioritized at the top of the list of plugins (which is the default for SoundJS).
  • SoundJS also allowed combining the preloading of my sound assets alongside images in a single preload “queue” (helped clean up the structure of preloading code and improved the advancing of animated loading bars).

A couple of shortcomings (with one light at the end of the tunnel!):

The availability of three different audio APIs (and ease in defining prioritization amongst them) came in handy with as one other HTML 5 quirk reared its head. During the final QA of Martha’s Steaks, our team found that many sounds weren’t playing in Internet Explorer 9 and 10 when using the HTML audio tag implementation. Diving into the issue, I found that the large number of distinct audio assets (over 120) meant that just as many audio tags were being created on the HTML page that contained the game. These were too many for IE, which I learned has a browser-specific limitation on the number of audio tags that can be created on a given page. The result was that IE would only load the number of audio assets it could support as audio tags (a number I couldn’t confirm, and may be OS and hardware specific), and all unloaded sounds would just never play. Not OK for a game that otherwise ran identically in IE compared to other supported browsers and platforms. The “solution”?  I set these two games to use the Flash based audio playback offered by SoundJS for IE users, if a user already had Flash installed. HTML audio acts as a fallback option, but we found that using the Flash audio API in IE resulted in all sounds loading and playing, consistently and as intended. It may seem a bit counterproductive to use Flash for new HTML 5 games, but it’s not required and we’re only relying on it to offer the most optimal experience for a certain subset of users, if those users already have it available.

The QA process for Night Light also shed light on another vital issue- loading times. By switching to SoundJS and its support for multiple audio file assets, the sheer number of assets being loaded for these two games sky rocketed compared to my past projects. Martha’s Steaks had 121 unique .mp3 audio assets being loaded and used… just for Martha’s in-game dialogue! Both of these games had especially long loading times, which were noticed by our team and mentioned by our test users. This is compared to my previous approach of using a single audio sprite with SoundManager2. That older process had all the headaches and repetitive work of creating and modifying audio sprites, but it did result in better loading times, thanks to the early games’ audio profiles consisting of only one file to be fetched and loaded from a web server. I wanted to improve the loading times on these new games, while sticking with SoundJS to keep the worthwhile benefits it offered. That’s when I remembered reading a recent CreateJS blog post that announced experimental support for audio sprites within SoundJS. This new feature, as well as the active discussions readers were having in the blog comments, have without-a-doubt changed (yet again) how I handle HTML 5 sound from this point forward…


“Work Smarter, Not Harder” – invested in building a workflow that allows use of auto-generated audio sprites and proper-syntax timestamp code with the newest SoundJS. Worth it.

Learning that reverting back to using an audio sprite would likely improve my games’ loading times, I got chills thinking about switching back to creating audio sprite files by hand. But the discussions in the blog post above led me to check out Audiosprite, an automation tool that has taken the pain out of using audio sprites by generating an audio sprite file (in various formats and with proper spacing between sounds) from any number of discrete sound assets. It also exports the necessary time values in JSON (a clean format usable for code). The only additional piece I had to add to the workflow was converting the timestamp code output by Audiosprite into the syntax used by SoundJS. I automated this conversion by creating a simple JSON converter, which I’ve shared as a JSFiddle.  And thankfully, the latest SoundJS-NEXT has made working with sound assets contained in a single audio sprite virtually identical to working with those assets delivered as discrete audio files.

Once I created a streamlined workflow with Audiosprite and my snippet above, converting the games Martha’s Steaks and Night Light from massively multi-file (121 and 48 discrete audio assets, respectively) to single audio sprite games took only minutes. Performance and sound playback seemed to remain the same after the switch, and the benefits in shortened loading times were immediate, significant, and very awesome:

Loading Time Tests:

My informal loading time tests for Martha’s Steaks and Night Light (with two versions of each game running on the same internal WGBH web server (different URLs), both using SoundJS 0.5.2 for audio and completely identical other than use of discrete audio files for every narrator sound asset vs. single audio sprite for all narrator audio:

New Tests (5/7/2014):
*all loading times indicate time from beginning of loading PreloadJS queue to completion of queue load. This includes sound and all image assets.

Night Light:
A: “early sound”: 48 discrete audio assets (SoundJS) for narrator dialogue
B: “new sound”: 1 audio sprite file (SoundJS NEXT) for narrator dialogue

1) Macbook Pro with OS X 10.9.2 and Google Chrome 34.0.1847.131
Testing both game URLs being opened simultaneously in separate ‘private’ windows.

Ethernet connection to WGBH office internet connection
A: 12 sec for loading to complete
B: 7 sec for loading to complete

Macbook Pro Tethered (via WiFi adhoc with Google Nexus 5) to T-Mobile LTE connection:
A: 73 sec for loading to complete
B: 45 sec for loading to complete

2) Google Nexus 5 via cellular LTE (T-Mobile) in Boston, MA area:
Chrome 34.0.1847.114, Android 4.4 Kitkat
A: 32 sec
B: 27 sec

Martha’s Steaks:
A: “early sound”: 121 discrete audio assets (SoundJS) for narrator dialogue
B: “new sound”: 1 audio sprite file (SoundJS NEXT) for narrator dialogue

1) Macbook Pro with OS X 10.9.2 and Google Chrome 34.0.1847.131
Ethernet connection to WGBH office internet connection
A: 4-5 sec
B: 3 sec

Macbook Pro Tethered (via WiFi adhoc with Google Nexus 5) to T-Mobile LTE connection:
A: 89 sec for loading to complete
B: 35 sec for loading to complete

2) Google Nexus 5 via cellular LTE (T-Mobile) in Boston, MA area:
Chrome 34.0.1847.114, Android 4.4 Kitkat
A: 63 sec
B: 51 sec

Original Tests
Martha’s Steaks:
SoundJS discrete files (121 .mp3 files for narrator): 42-46 seconds
SoundJS audioSprite (single .mp3 file for narrator): 30-32 seconds

Night Light:
SoundJS discrete files (48 .mp3 files for narrator): 45 seconds
SoundJS audioSprite (single .mp3 file for narrator): 22-30 seconds

Both games essentially had loading times cut down by a third (or more) by switching to an audio sprite. May not seem like much, but shaving 15+ seconds off a loading time could be what gets a kid playing your game soon enough that they won’t decide it’s “broken” and move on to something else.

So… what ?

So, in concluding this lengthy dive into my journey so far with HTML 5, I want to reiterate that I am not claiming whatsoever that I’ve figured out some “secret sauce” that will make sound in HTML 5 interactives work perfectly, every time and everywhere. Not anywhere close to that. I hope that I’ll have many more opportunities to continue creating interactive stuff in HTML 5 so that I can continue to grow my experiential knowledge to draw from. But I am glad to see that my own work with HTML 5 audio to date has led me to acquire evidence-backed preferences for a specific sound framework (SoundJS) and approach to handling sound (packing multiple sound assets into a single, TOOL-GENERATED audio sprite). This combination has allowed the audio in my latest games to seemingly work everywhere we require, to play multiple audio clips simultaneously without interruption, and has reduced the audio-related loading times users will have to wait through. I’m looking forward to discovering and (hopefully) conquering more sound-related pitfalls as my adventures in HTML 5 development continue!

TL;DR  (“too long, didn’t read” summary)

First used SoundManager2 and audio sprites when beginning HTML 5 game development. SM2’s shortcomings and headaches in the workflow of creating and using audio sprites resulted in switch to the SoundJS library. SoundJS proved to be more robust and reliable (for me), and allowed convenient use of multiple sound files (unique file for each sound asset), with effortless simultaneous playback of different sounds. Using multiple audio files resulted in detrimental increases in loading times, however. Very recently, discovered the NEXT build of the current (0.5.2) SoundJS library, which has new (experimental) support for audio sprites. Also discovered Tonistiigi’s Audiosprite tool to automate creation and simplify use of audio sprites (with help of this fiddle). End result—use of latest SoundJS NEXT has delivered consistency in sound playback, along with significant reductions in loading times thanks to use of audio sprites. Best of both worlds, for now. As is the nature of working with sound in HTML 5, a lot could change tomorrow.

  1. OJay says:

    Hi Bharat, this was an interesting read with useful insights.
    Its rewarding to see SoundJS working well on real world projects.

    One thing I would suggest is adding which device you tested on to show the difference in loading times from discrete files to single audio sprite. In theory, it should have the largest impact on mobile phones connecting to websites through a cell tower.

    • Bharat Battu says:

      Thanks Ojay, and thanks to you and the whole team behind CreateJS – it’s been really instrumental in making the Flash-to-HTML 5 leap.

      I’d be glad to expand upon the informal loading time test numbers, sometime early next week. For now, I can quickly clarify that for “Martha’s Steaks” (launching in June), the loading times I’m reporting above were gathered from discrete audio vs. audioSprite (otherwise-identical) versions of the game that I loaded in Chrome for Android on a Nexus 5 with Android 4.4 (Kit Kat), online connection via cellular data (T-Mobile LTE in the Boston MA area). The reported differences in loading times for “Night Light” were also on a Nexus 5 on LTE cell data. I did see improved loading on desktop Chrome browser (OS X) for the audioSprite versions, but it wasn’t as big of a leap (and I was on a fast work ISP ethernet connection). I can do some more informal tests across browsers, devices, and connection types next week and report back here.

Have a Comment?

Some HTML is OK