The viability of replacing human singers with computer synthesised voice

Introduction:

Some of the fans of ‘VOCALOID’, a singing synthesiser made by Yamaha Corporation that generates synthesised vocal sound with the input values of melody and lyrics by using sampled voice from human vocalists, asked what the viability of replacing human vocalists with voice synthesis would be. Inspired by questions from VOCALOID fans, especially the fans of the most famous Japanese virtual singer ‘Miku Hatsune’, the research question chosen to be investigated was focused on the viability of the replacement of human singers with voice synthesis technology. Research on the technology and price of voice synthesis, the cases of actual replacement and interview with people working in the music industry were performed for making a conclusion of this research.

What is ‘Singing Synthesis’?

It can be said that the viability of replacing human singers with voice synthesisers is dependent on the actual structure and processes required for voice synthesis. Before starting the comparison between the singing synthesisers and human vocalists, it is very important to understand what voice synthesis is and how it works to have an accurate result. According to a speech from Kenmochi and Oshita, the head engineers of VOCALOID team in Yamaha Corporation, VOCALOID is a singing synthesiser developed by Yamaha Corporation that consists of three different parts – Score Editor, Singer Library and Synthesis Engine (Interspeech 2007). Score Editor is the interface that provides users to input notes, lyrics and optional expressions (Interspeech 2007). After lyrics input is finished, Score Editor converts the texts into phonetic symbols (or syllables if there is a word that consists of two or more syllables) that Synthesis Engine can read and synthesise voice with Singer Library – a database of recorded voice samples from a human voice actor that consists every possible combination of consonants and vowels and sustained vowels in target languages (SMAC 2003) – that can be considered as a virtual singer (Interspeech 2007).

Figure 1: “System Diagram”, Figure 1 on Kenmochi, Oshita &, Interspeech 2007.

The process of voice synthesis is done by the transformation of samples such as transposition (changing the key of a sound file), phase correction (“the process of mixing the real and imaginary signals in the complex spectrum that is obtained after Fourier transformation of the free induction decay” (Brouwer H, 2009)) and equalisation (the process of adjusting the balance of the frequency of an electronic signal) in terms of Spectral Peak Processing (SPP) and phase and shape concatenation (SMAC 2003). Because human vocalists are required for the development of voice synthesis as an original voice for reconstruction as the technology of the voice synthesis is about assembling pre-recorded audio files, it can be said that the use of voice synthesis is technically not a ‘replacement’ of human singers. Thus, further research and comparison will be about the viability of ‘literal’ replacement of human singers.

Guidelines for users of voice synthesiser

Reference guides are necessary for people trying to start using a product. Reference guides help new end-users to know about a released product as much as required, thus the end-users can use much of the functions of a product as they want. In the case of VOCALOID, there are online video tutorials and official reference manual that end-users can access to. Official video tutorials are available on VOCALOID official YouTube channel which is listed on the official VOCALOID website. Tutorial videos include the “Tips & Editing Tricks” and “Operation Manual”, guides for use of specific functions of VOCALOID engine (“Learn”, n.d.). VOCALOID Reference Manual is included in the software package and consists the information of user interfaces, functions of software and chart of phonetic symbols in supported languages – English, Japanese, Korean, Spanish and Chinese (Mandarin, with Bopomofo) in respective sequence.

Comparison on recording price and time management

The viability of replacing human singers with voice synthesisers is partially dependent on the recording price and time. Price of recording needs to be affordable to users, therefore, the users would not hesitate to have practical trials. If the trial of demo recordings costs a lot, clients would be less likely to have enough number of recording trials. Voice synthesisers can be purchased online at official websites of those software or retail stores. In the case of VOCALOID5, the latest version of VOCALOID engine, there are two different packages that can be purchased: standard pack, a package that includes 4 different singer libraries with voice synthesis engine and costs USD 225.23 before tax for user license, and premium pack, a package that includes 8 different singer libraries with voice synthesis engine and costs USD 360.36 before tax for user license (VOCALOID n.d.). Users can purchase user license of additional singer libraries on websites and their average cost is around USD 74.39 before tax per voice. The table below is the list of some of the singer libraries available on online stores.

Figure 2: List of available additional singer libraries. Crypton Future Media, “The software that makes singing computer become real”, Yamaha Corporation, “ADD-ON VOICEBANKS”.

UTAU, a free Japanese shareware synthesiser made by Ameya/Ayame, is available on its official website (“歌声合成ツールUTAU”, n.d.). Permanent user license of Synthesizer V, a synthesiser made by Dreamtonics Corporation Limited, can be purchased on its official website and costs USD 80 before tax (“Synthesizer V”, 2019).

Various websites for hiring vocalists are available. Music producers can contact vocalists online to check expected recording fees via web services such as SoundBetter, Fiverr, Vandalism Sounds and Vocalizr. Majority of the professional vocalists on SoundBetter do not disclose average voice outsourcing costs but ask to contact them for pricing. Professionals on SoundBetter who disclosed average voice recording cost are usually taking USD $200~350 per song on average (“Top Female Singers for Hire”, n.d.). Fiverr is not introducing vocalists only but also songwriters. The outsourcing cost is very variant and not standardised, which can make clients to feel confused. The units commonly used for pricing of voice recording by artists are bars and seconds. Vandalism Sounds has well-organised pricing for hiring their vocalists. The recording fee for 1 verse (16 bars) is 139 UK Pounds (GBP) before tax (“Vocals”, n.d.). In addition, if the song is for commercial use, the royalty of song with vocal recording splits unless the songwriter purchases full royalty rights with additional costs from vocalist(s) (“Vocals”, n.d.). Vocalizr is a website very similar to SoundBetter. Various professional vocalists with some voice samples can be found on the website (“Find Vocalists”, n.d.). Music producers can contact suggested professional vocalists and leave reviews. All websites assume that it will take around 5~7 days for voice recording. This result may affect the clients who want to make multiple trials within a short time, thus may act as the reason for ‘replacement’ to them. However, these are not major factors thus this is not a crucial factor for answering the research question.

What people think about voice synthesis

The viability of replacing human singers with voice synthesisers is also dependent on public opinions related to voice synthesis. Email interviews with people who have experience related to the music industry were processed to collect the data of their viewpoints and expectations. Every individual response of interviewees was identical, but they all had common points; synthesised voice sounds quite different from the natural human voice. Majority of the interviewees said the quality of the result made with voice synthesis would depend on what artists are trying to ‘produce’ with the software and this trial can be ‘experimental’. The ‘emotion’ and ‘spirit’ of singing was acting as a ‘bias’ of this research, making it much difficult to get a clear conclusion.

Interviewee Dowling D (2019), a music technology supporter who works for Music Education Network, does not recommend synthesising a lead vocal for pop or another style where the desired result is a ‘human singer’ who delivers the melody, hook(s) and lyrics. Dowling D (2019) says it would make practical and economic sense to explore electronically generated vocal options if what artists are intending to make is a specifically robotic or ‘synthesised’ aesthetic. This opinion is supported by the response from interviewee Whittington S (2019) by saying the preference of the sound depends on the situation, having a VOCALOID opera “The End”, an opera performed with virtual singer ‘Hatsune Miku’, by Japanese composer Keiichiro Shibuya, as an example of ‘preferable case’ of the use of voice synthesis; the point of the opera was to explore ‘life’ and eventual ‘death’ of a virtual character (Whittington S, 2016), therefore the use of ‘not alive’ voice can be appropriate. In general, Dowling D (2019) thinks to provide ‘budget options’ for the programmable synthesised generation of vocal styles to replace singers where “funding or circumstance does not allow for the hiring of human performers”.

According to the response from interviewee ‘Itsuwara’ (2019), a music producer in South Korea who creates music with VOCALOID, the sound style of singing synthesis depends on the taste of music producers; singing style can be tuned so the sound made with voice synthesis can be sound more humane or robotic. ‘Itsuwara’ claims the singing technique is up to the preference and mastering skills, therefore it might be hard to compare human vocalist and singing synthesizer with singing techniques. Both interviewees ‘Itsuwara’ (2019) and ‘Muse Queen’ (2019) recommend using it to music producers who are struggling with hiring vocalists or looking for special voice different from other human vocalists – ‘Itsuwara’ (2019) additionally said the actual purpose of the development of VOCALOID is to help music producers to make special songs. ‘Itsuwara’ (2019) was very happy with the existence of the software, said that he is using the software very well and thanks to the engineers contributed to the development of singing synthesizer, however, he wants to complain about the stability of the software.

Interviewee ‘Muse Queen’ (2019), a freelancer vocalist who lives in South Korea, does not consider voice synthesiser as a competitor of human vocalists but another ‘genre’ of music in a subculture. ‘Muse Queen’ (2018) said the quality of the ‘outcome’ made with the technology can be improved by adding a technique of smoothing the gaps between phonemes.

Interviewee Power P (2019), an opera tenor who lives in New Zealand and does international performance, a Senior Lecturer in both contemporary and classical voice at Tertiary level, professor of a Vocal Pedagogy course and teacher of all ages and styles of pupils, argues that it is not acceptable to use voice synthesis technology in any cases of music production. Power P (2019) claims there is no ‘soul’ or ‘spirit’ of singing in the voice synthesised by software, talking about Korean traditional song “Arirang” sung by So-hyang Kim as an example of singing with ‘emotion’.

Whittington S (2019), head of Sonic arts in University of Adelaide who teaches tertiary music institution, said the preference of voice may vary on the situation and singing synthesis technology should be developed a lot if what developers aiming for is a ‘reality’ of voice. Whittington S (2019) suggests the better way of the development of voice synthesis is not developing the technology as a ‘replacement’ of human singers which can sound ‘realistic’ but developing it as another instrument in parallel with the vocalists.

The actual case of ‘replacement’

The viability of the replacement of human vocalist with computer synthesised voice can also be assessed with the actual ‘replacement’ of human vocalist with voice synthesis technology. There was a case of literal ‘resurrection’ of a dead vocalist with a voice synthesiser. In 2011, Yamaha Corporation made a private VOCALOID singer library based on the voice of Hitoshi Ueki, a famous Japanese vocalist who died in 2007. VOCALOID team announced that it was successful to make a singer library from a singer who could not participate – Separated voice recording is usually required for making singer library to record every possible syllable in the target language (Kaufman R, 2011). Initial recordings made with the singer library was streamed on a Japanese video-streaming website, and the singer library became called as ‘Ueki-loid’. This case of the actual ‘replacement’ describes how further the technology has been developed and the ‘replacement’ of the human singer is possible thus directly claims that it is viable to replace human singers with voice synthesis, therefore this data is very important for making the conclusion.

Conclusion

From the result of current research progress, it can be concluded that the comparison between the capability and recording price of VOCALOID and human singers was a question difficult to find a conclusion. Kenmochi H (2011) said VOCALOID is not regarded as a surrogate of human singers but a type of new musical instrument during the interview with Kaufman R (2011). Referencing the interview responses from Dowling D (2019) and Whittington S (2019), It would be much better to consider and develop voice synthesiser as a type of music instrument parallel to human vocalists rather than a replacement of human singers. Some people who think like Power P (2019) would not intend to listen to the songs made with voice synthesis, therefore the influence of singing synthesis may vary depending on what users prefer.

Improvements required

The main structure of my research was built on extensive reading of scientific journals and studies online, which allowed me to quickly establish a specific understanding of the voice synthesis technology. It was decided to obtain the data about the structure of the voice synthesisers as it was expected that understanding of voice synthesis is required for this research. During the first step of the research, researching about what voice synthesis is and how it works, gave me the ‘statement’ that the voice synthesis technology is the technology of reconstructing recorded speech samples for making a speech phrase, which gave me a part of the research answer ‘the voice synthesisers should not be considered as a factor that can replace human singers but another method of using human voice’. Majority of the data available online were ‘unofficial’ resources those cannot be trusted such as fan fictions and ‘personal opinions’. A lot of ‘fan artworks’ were filtered out, being considered as ‘unreliable’ resources, while ‘public image’ is one of the most important factors for marketing.

The ‘preference’ of the music genre and favourite artists are very identical to every individual. Every different generation in different regions have a different perspective, therefore it is hard to make a generalised conclusion. The ‘trend’ of music market constantly changes as time goes on. The ‘skill’ of music production is quite dependant to the artists, where the use of voice synthesis is usually done by ‘beginners’. The ‘languages’ available for the singing synthesisers are limited. These factors need to be considered for in-depth research, however, performing worldwide research can make the research to be too broad and may cause bigger problems, therefore it was decided to limit the targets to the ‘internet users’ only.

Only limited range of questions, those about the preference and viewpoints of the songs that are already released, could be asked to the limited number of people – 7 interview requests were sent, 5 people replied – working in the music industry. For improvements, the investigation of the audio quality such as audio frequency and amount of white noise and the quality of speech such as pronunciation and the transitions between syllables had to be performed for research about the viability of the replacement, as the ‘audio’ is one of the biggest parts of ‘music’, and these investigations could be done by cooperation with audio engineers if the audio engineers happy to participate were available. Even the email interviews were performed with audio experts, the opinions of experts cannot be the exact perspective of the opinions of others, therefore different experiments about the preference are also had to be done in public. More interviews with music producers who make/made songs with the use of voice synthesis could be done for making a brief description of what those people think about the use of voice synthesis in general.

We are accepting feedbacks. Please do not hesitate to send email to producer.p@pseudoartist.com for discussion.

Bibliography

Websites:

AH-Software, “Information about AHS Store Discount Prices and Download”, n.d. Viewed on Feb 9, 2019.  < https://www.ah-soft.com/store/ahsuser_en.html >
Crypton Future Media, “The software that makes singing computer become real”, n.d. Viewed on Feb 9, 2019. Written in Japanese. < https://ec.crypton.co.jp/pages/prod/vocaloid >
Fiverr, “Singer-Songwriters”, n.d. viewed on Feb 5, 2019.  < https://www.fiverr.com/categories/music-audio/singers-songwriters/?filter=rating&ref=service_type%3Asinging%7Cpackage_includes%3Acommercial_use%7Corigin%3Aheader >
Rachel Kaufman, “Speech Synthesizer Could ‘Resurrect’ Dead Singers”, n.d. Viewed on Feb 20, 2019.  < https://www.wired.com/2011/12/ueki-loid-speech-synthesizer/ >
SoundBetter, “Top Female Singers for Hire”, n.d. viewed on Feb 5, 2019.  < https://soundbetter.com/s/singer-female >
“Synthesizer V”, 2019. Dreamtonics. Viewed on May 5, 2019. < https://synthesizerv.com/en/ >
“歌声合成ツールUTAU” [Vocal Synthesis Tool UTAU] (in Japanese). UTAU. Viewed on Apr 30, 2019. < http://utau2008.web.fc2.com/index.html >
Vandalism Sounds, “Vocals”, n.d. Viewed on Feb 5, 2019.  < https://vandalism-sounds.com/vocalist.html >
Vocalizr, “Find Vocalists”, n.d. Viewed on Feb 5, 2019.  < https://vocalizr.com/vocalists >
Yamaha Corporation, “Learn”, n.d. Viewed on Feb 9, 2019. < https://www.vocaloid.com/en/learn/ >
Yamaha Corporation, “VOCALOID”, n.d. Viewed on Feb 9, 2019.  < https://www.vocaloid.com/en/ >

Documents:

Bonada, Loscos, Kenmochi, “Sample-based Singing-voice Synthesizer by Spectral Concatenation”, Proc. Of SMAC 03, 439-442, 2003.
Brouwer H, “Evaluation of algorithms for automated phase correction of NMR spectra”, December 2009. Journal of Magnetic Resonance. < https://www.sciencedirect.com/science/article/abs/pii/S1090780709002730 >
Kenmochi, Hideki; Ohshima, Hayato. “VOCALOID – Commercial singing synthesizer based on sample concatenation”. Interspeech 2007. Archived from the original on June 6, 2012. < https://www.webcitation.org/68EBFxf5V?url=http://www.interspeech2007.org/Technical/ssc_files/Yamaha/VOCALOID_Interspeech.pdf >
“VOCALOID5 Reference Manual”, n.d. Yamaha Corporation. Viewed on Feb 9, 2019. < https://rsc-net.vocaloid.com/assets/pdf_files/VOCALOID5_Reference_Manual_ENG.pdf >

Email interviews:

Email interview with Dowling, D by author. March 11, 2019.
Email interview with ‘Itsuwara’ by the author. February 13, 2019.
Email interview with ‘Muse Queen’ by the author. February 13, 2019.
Email interview with Power, P by the author. February 13, 2019.
Email interview with Whittington, S by the author. February 13, 2019.

Leave a Reply

Your email address will not be published. Required fields are marked *