“Continuous Sound” Recording: UTAU Upgrades Realism

Hello! I’m Aster (Aster Selene), and you may recognize me as the girl who used to put snarky, troperiffic reviews on all of the UTAU rankings. Sadly, that’s not true anymore, as my parents don’t take well to me taking breaks from my studies for longer than 15 minutes, and UTAU ranking reviews usually take longer than that… Rest assured that someday, I will storm back in full glory into those reviews and start re-applying snark to them.

In any case! Today I’m here to discuss the use of continuous sound (連続音 or renzokuon) voicebanks, also known as VCV. They’re also erroneously called “triphones”. Continue reading after the jump!

First, a little bit of history. I’m not so good at this part so I’m going to paraphrase what a friend of mine said on a different venue:

The VCV (vowel-consonant-vowel) method was developed by Ameya, the creator of UTAU. It was first displayed with the UTAU Momone Momo, with the song “Kenka Wakare” (original by MimiroboP).

There was a catch, though. Momo’s voicebank at the time did not have a full VCV voicebank – it had what is now called a “Lite” reclist (a reclist being the list of syllables you record for an UTAU voicebank). The list didn’t cover all possible combinations, so while it could sound good at some parts it would go back to CV (normal) quality in others (and if you listen, you’ll hear that in the middle of the song it starts getting a little choppier).

This is the part most people skip: After Momo came the UTAU Otodamaya. Not a lot of people really know her even now.

Afterwards came Shirakane Hiyori, who had a different reclist than Momo, but it was still a Lite list.

And of course, a bunch of UTAU started to follow suit. Some started to experiment around with complete “standalone” VCV banks that would cover an entire song, but the idea never fully caught hold…that is, until a new UTAU named Namine Ritsu came with a standalone list. And here, things started to take off.

Now, how does VCV work? Instead of recording flat syllables like “ka” and “to”, you would record ones with vowels before them like “a ka” and “i to”. For example, if you were to plug in the first line of the song “Toeto” in CV, you’d write

あ な た の こ と が

(a) (na) (ta) (no) (ko) (to) (ga)

And if you were to do this in VCV:

– あ    a な    a た    a の    o こ    o と    o が

(- a) (a na) (a ta) (a no) (o ko) (o to) (o ga)

The methodology utilizes the “overlap” function. In UTAU, when you configure the oto.ini (the file that distinguishes your consonants from your vowels to prevent the program from stretching out your vowels or using a long space in recording as a note), there’s something called overlap, and in CV voicebanks it’s used to make consonants a little less awkward (you need a bit of overlap in consonants like k or ch).

In VCV voicebanks, the overlap smooths out the vowels. In the above example, “a no” and “o to” – the beginning  o of “o to” would overlap on the o of “a no”, mixing them together.

The recording method is also modified. Because recording every single possible vowel-syllable combination would take up disk space and drive the voicer mad, the syllables are recorded in sets. To take an example from Ritsu’s VCV bank:



This is one sound file in Ritsu’s bank, and this style of recording will yield seven syllables if the oto.ini is done properly: “- ka”, “a ka”, “a ki”, “i ka”, “a ku”, “u ke”, and “e ka”. This significantly cuts down on the number of recordings needed to fill up a whole bank.

(Note: Because Ritsu’s bank has each file contributing seven syllables, the bank is referred to as being “7-mora”. Other lists exist with other numbers of mora; for instance, Sukone Tei’s VCV is 5-mora, as well as Takano Yuki’s.)

It also helps to use a guideBGM, which is essentially a song that plays in the background to help the recorder “sing” his or her samples to the beat. It regulates the rhythm and pitch, which makes it easier for the UTAU resampler to filter.

VCV has two major advantages. The first is that it makes transitions between syllables much smoother, since even the transitions are recorded by the voicer. The second is that it creates a plausible space between the vowel and the consonant when it’s natural – it’s impossible to completely hold out a vowel without creating a tiny little space for things like “a ta”, but UTAU will make CV voicebanks bleed the vowel into the consonant.

Currently, VCV itself is not very well-known to people who don’t use UTAU and only listen to the songs (though they probably notice that some songs sound smoother than others), but it’s very popular amongst UTAU users and voicebank voicers, and most popular voicebanks utilize VCV.

Also, VCV sounds a lot more realistic. Compare the old version of an original Tei song “13km” with the new VCV voicebank:

VCV has also spawned more recording methods too, mostly for other languages. The “CV-VC” method, though debatable as to whether or not it is better, is being popularized for voicebanks handling languages such as English, Korean, and Chinese.

My personal experience with VCV? I’ve been using UTAU since November of 2009 (and I’ve had an UTAU since February), but even then VCV still feels like something new to me. My UTAU doesn’t have VCV much better than her CV bank – but this is because I did a pretty poor job recording (no guideBGM, pitch too low, and I got lazy somewhere near the end). So in the end, I feel that VCV can be a very powerful tool – if it’s done correctly. It requires a lot of work and a lot of patience, but in the end it can bring some very good results when done right.

And does a voicer or user have to use VCV? Absolutely not. There are still ways to do great things with CV; for instance, the popular songs Trip Trip and Hana ni Naru were made using CV voicebanks. VCV just makes things a little easier.

deztora says:

I see VCV as a huge step forward for UTAU. To use a DDR analogy, CV is like playing normally, and VCV is like hugging the bar. Sure, there are amazing players who don’t touch the bar, and there are terrible players who hang onto the bar as if it’s their lifeline. However, as evidenced by In The Groove, the best bar player will always be able to beat the best non-bar player, because s/he can do things that the person who doesn’t hug the bar when playing just can’t do. The best VCV can make an UTAU sound as realistic as a human singer, though in a different way from Vocaloid.

Building off of the last point Aster made, though, this is not to say that CV is “bad” and VCV is “good.” (As an example, I’ve found that CV makes for better talkloid videos than VCV does.) However, it is a huge step forward, and I eagerly await more advancements with the UTAU program.

  1. Yeah outside of teto I've hardly ever listened to anything UTAU related and wasn't really impressed to be honest. But even just listening to that first VCV usage in Kenka Wakare has me reconsidering a bit…This was a very well written article.

    1. Sure, the overall "listenability" of UTAU is lower than that of Vocaloid, but there are some real gems out there! Being a completely community-driven medium, there are many times more UTAUloids than Vocaloids, and they fill the holes that the current selection of Vocaloids leaves.

      For example, Sekka Yufu is the most softspoken UTAU ever. Her songs reach a level of chillness unlike any Vocaloid. The closest I've heard are certain Append Dark songs, but it's still not the same. Yufu's voice might take a little getting used to (it's very breathy), but I've come to love it.

      Here, try listening to slight light.
      [youtube 25dGoEU0pa8 http://www.youtube.com/watch?v=25dGoEU0pa8 youtube]

      1. Wow! That was the first video I've seen of Yufu and she sounds amazing! 😀 She still has a slight robotic-ness to it, but for an UTAU, she sounds amazing! Even more so than Kaai Yuki, in my opinion.

      2. Actually, I don't really think the "listenability" of UTAU is any less than of a VOCALOID, especially in well done VCV cases.
        In fact, a lot of the time, as long as you know who you're looking for, I find it easier to listen to an UTAUloid, as poor usage of UTAU doesn't result in the same "flat" effect as Vocaloid, it's just… well, boring and plain, not off-key.
        However, it really is a case of needles in haystacks. As there are literally hundreds of UTAUloids, It's really easy to find something dodgy, harder to look through all those videos to find something good.

        Oh, and if you like the sound of Yufu, you /might/ want to check out more Shirakane Hiyori songs, like her cover of "Relics", though some people really dislike her.
        [youtube HwrKMlxKSSg http://www.youtube.com/watch?v=HwrKMlxKSSg youtube]

        Although, personally, the reason I like UTAU is just because of the large amount of males. High quality males. I mean, Suiga Sora rivals Kaito, even though he's pretty robotic. http://www.nicovideo.jp/watch/sm11206033 (A derp example, sorry about that, I just like that song, haha.)

          1. That might be because both of those were by the same producer, メリヒヨP. He's known for his exceptional tuning of UTAU. Here's his mylist: mylist/14309270.

            Also, those songs? Here they are on NND. Relics cover: nm8197415. Sakura cover: nm9036432.

            My personal favorite? A cover of Konayuki (nm9956137). What's that? You've never heard of Yamatouchi Iori before? Well you have now, and you now love him. You're welcome.

  2. Very interesting! I'm loving articles like this. It's great to know something new, something you had no knowledge about previously. I guess it helps that I find UTAU interesting and enjoyable, but still. Great article! Makes me curious about making something myself. (Hah, that will never happen.)

  3. I tend to use both VCV and CV voicebanks. As of now, the only UTAU I’ve been using is Momo, and some parts sound better with CV. For example: いる. I do not like the sound of i る, so I use CV for that.

