Hideki Kenmochi, one of the developers at YAMAHA behind the VOCALOID sound synthesis engine, flew to Los Angeles to give an overview and short tutorial of VOCALOID software at Anime Expo 2012. In his presentation, Mr. Kenmochi gave an overview of VOCALOID and described at a high level how the VOCALOID synthesis engine actually worked under the hood. He then laid out the workflow for creating a new VOCALOID sound bank and then pointed out new features in VOCALOID3. Lastly, he gave three short demonstrations on how to use the software, introduced Job Plugins and then finally gave his philosophy on the use of VOCALOID.
With the audience seated in the room, Kenmochi began his presentation by introducing himself, saying he had just flown in from Japan after giving a concert the previous day. His panel was originally scheduled for Saturday, but it had to be moved to Sunday for this reason. With his brief introduction done, he then gave an overview of VOCALOID; it’s a concatenation-based singing synthesizer developed by YAMAHA in order to create intelligible, smooth and natural voices with an easy-to-use environment. Development had started in March of 2000 and the first prototype was shown at Musikmesse 2003 in Frankfurt. The first VOCALOID products came out in 2004, followed by VOCALOID2 in 2007 and VOCALOID3 in 2011. Prior to VOCALOID3, YAMAHA only licensed its software out to third-party developers. However, with VOCALOID3, YAMAHA started handling the release of the newly updated VOCALOID3 Editor, while third-party developers handled the sound banks. A Tiny VOCALOID3 Editor was included with the sound banks in order for users to test the voices, since a standalone sound bank couldn’t generate sound.
Kenmochi then moved on to describing how the engine worked. The main basis of the engine is the concatenation of recorded syllables into new vocal lines. The sound bank consists of diphones and sustained vowels, totaling around 500 samples for Japanese and 2500 for English. With simple concatenation, the singing isn’t very smooth. It was, “in a sense, interesting” according to Kenmochi. Timbre smoothing is performed in the frequency domain in order to smooth out transitions between samples. Another issue is that the start of samples do not coincide with the start of notes; for example, the “sa” sound spends a lot of time on the “s” sound, but the start of a note sung with the syllable “sa” would be more aligned with the beginning of the “a” sound. Hence, timing of samples would have to be shifted in order for the sung notes to be in rhythm with the song.
Next on the presentation was a description of how a sound bank is created. The person voice the sound bank first sings through a specialized script, which sounds a bit like Buddhist chanting according to Kenmochi. Japanese and English scripts both sound similar, although the latter is much more complicated. Apparently, once they invited an American singer to record for VOCALOID, and he was happy and enthusiastic at first. However, as recording progressed, he gradually got angrier and angrier and eventually escaped. The recording process for Japanese takes roughly two to three hours, whereas for English it takes roughly four to five. Once recording is done, the recording undergoes phonetic segmentation (breakdown into phonemes), diphone segmentation (group into pairs of adjacent phonemes), pitch detection and spectral envelope processing. This whole portion of the process takes roughly three months. After this is done, the sound bank library is then packaged into a product, ready to be sold.
With the introduction of VOCALOID3, several things have been improved in the engine. Kenmochi points out that VOCALOID3 has now a better spline-based interpolation for timbres, smoothing out abrupt changes in timbre when there are large intervals between adjacent notes. VOCALOID3 also introduced triphones, where a specific sequence of three phonemes can trigger a special triphone sample instead of using a blend of two diphone samples, and consonant length control, which apparently reduces how metallic certain consonants sounded. Lastly, VOCALOID3 brough a new editor program, and on that note, Kenmochi moved on to demonstrate the new editor through three short tasks.
For his first demonstration, Kenmochi sought to recreate the first part of the opening song to Cutey Honey using VY1V3. He inputed the notes into the editor, using romaji to set the lyrics and then imported a backing track for the song. VOCALOID3 Editor also supports VST plugins for manipulation of tracks, and so a demonstration of “too much reverb” was also shown to give the audience an idea. Next, after opening a pre-prepared .vsqx file containing just the raw notes and lyrics, Kenmochi proceeded to show some tuning techniques. He first used reduction of VEL (velocity) to increase the length of consonants, making the singing more staccato. He then showed how to silence the vowel part of a syllable by appending “_0” to the vowel phoneme. Lastly, he made the singing sound more like classic idol singing by manually adding an upward pitch bend at the end of long notes.
With the Cutey Honey demonstration done, Kenmochi was ready to show the audience an English workflow by trying to recreate a part of Amazing Grace. After importing the backing track and selecting Sweet Ann, he plopped down some notes and started inputting words straight into the editor. However, for more control over how syllables are split over notes (i.e. the creation of melismas), the syllables have to be manually split and a hyphen is used to stretch a syllable across notes. A forward slash is then used to mark the end of a melisma. While the software supports a basic set of English, some words have to be manually inputted into the editor. For example, “wretch” required a custom entry consisting of the phonemes for the word. With the lyrics inputted, Kenmochi then demonstrated that you could change the sound bank to a different one without having to redo everything. He did mention that you couldn’t pick a Japanese singer because “in that case, the synthesis result would be very funny.” Of course, the audience wanted a taste of that and thus VY1V3 was picked to sing Amazing Grace;
apparently she had stage fright, and what came out was merely silence. Kenmochi promptly switched over to Prima to demonstrate the more British-sounding sound bank and then demonstrated Tonio after dropping the notes an octave.
The third demonstration was an attempt to make an English version of the Cutey Honey opening. After loading the Japanese version, making changes to note lengths, he inputed the English lyrics (copied from a website) for the verse into the editor. Of course, to spice things up, he also added that idol singing style tuning at the end, but did it too much and made the results a bit odd. With the three short tasks complete, Kenmochi then described the Job Plugin system that was added with VOCALOID3, stating that it uses very simple Lua scripts. He demonstrated the built-in Staccato plugin, which shortened note lengths, and then showed that the plugin code was quite short and simple. Next, he browsed through some plugins available through the VOCALOID STORE, including one called NyanNote which changed lyrics to “nyan”. He settled on demonstrating the IdolStyle plugin, which has similar capabilities to what he was doing manually to add classic idol singing technique to the song, and he noted afterwards that maybe the effect was a bit too much. He closed up the Job Plugin section with a wish list of what functionality he’d like to see in user created plugins, including realtime MIDI input, a staff editor, automatic composition (converts lyrics into a song), and automatic expressions. The SDK is available as a free download from the VOCALOID STORE.
The final part of Kenmochi’s presentation touched upon his philosophy behind VOCALOID. He noted that with VOCALOID, a music producer can directly communicate with his audience instead of having to go through an actual singer. However, he felt the reason VOCALOID sound banks needed characters attached to them was that audiences needed a symbol to help them accommodate to this new way of song production. However, he does believe that VOCALOID is a new instrument that can bring about change to the music landscape, by citing Beethoven’s sonatas as an example. He explained that the range of notes in Beethoven’s sonatas didn’t really change much, except at points when he received new pianos with updated technology allowing him to extend his range. That is, new capabilities allowed him to be less constricted in his music creation. He also cited the example of the synthesizer, which changed forever the sound of pop music around 1980. In closing, he asked the audience to create new music and a new future with VOCALOID, because it is a new instrument that will change the musical landscape and that it transmits emotions directly to the listener.
After the panel, Kenmochi talked to some enthusiastic audience members, took pictures with them and signed autographs. The following day, he also interviewed a few music producers who attended Anime Expo, getting feedback on the VOCALOID software itself as well as the music creation workflow.