Build a Text-to-Speech Chat App with Amazon Polly and ChatEngine

Accessibility for computer applications is quite often an overlooked consideration during development. A user interface that works for one person might be completely constrained for a disabled person. Hence, as designers and developers, the onus is on us to empathize with one and all so that we can bring in inclusivity in all our creations.

In this blog post, we take a dig into this very aspect of how to make apps more accessible. Since chat-based applications continue to grow in adoption, why not bring in the accessibility angle to it? Here at PubNub, we already have an extensible chat framework called ChatEngine. Let’s see how we can quickly enable text-to-speech capabilities for chat apps built with ChatEngine framework, to assist a blind user.

Technology That Assists Blind Users

Due to visual impairment, a blind user is severely constrained in reading the incoming chat messages. To assist this user, the most obvious solution is to use a text-to-speech engine that synthesizes speech from the incoming chat messages.

The technology behind text-to-speech is already available. But instead of locally processing the text, it’s better to leverage the cloud so that the system can continuously learn and improve its speech rendering capabilities. Amazon Polly is one such service offered under the AWS Machine Learning umbrella.

One of Amazon Polly’s most valuable capabilities is the ability to stream natural speech, which is a big plus for a realtime application like chat. We are going to leverage this feature to build a speech-enabled chat client on top of PubNub ChatEngine.

Introducing ChatEngine

If you are familiar with PubNub, you probably know how easy it is to launch your own chat app based on the ChatEngine. Just follow the ChatEngine Quickstart Tutorial and you will get a default chat app with all the source code.

Speech-enabled Chat App

To enable speech prompts for incoming chat messages, we can think of an icon that activates this feature. Something like this.

Notice the icon at the top right area of the chat UI. With this icon, we can enable or disable this feature with a simple click. Here is how you would experience this app now.

Note: The clicking of an icon is also a constraint for a blind user. As an enhancement, the UX designers can think of a keyboard shortcut or some other means of making it more accessible.

Tutorial Assets and Technology Overview

The source code of this speech-enabled chat app is available on GitHub. However, most of the code is taken from the default chat app’s source code that is provided by ChatEngine framework. With a few modifications, you can easily build the speech synthesis feature on top of the default chat app.

You can clone the repository and follow along to get a sense of the code changes that are required.

Let’s look at the building blocks of this app by exploring the various technology components that work behind the scenes to deliver a functional chat experience. The README file accompanying the repository contains the steps for setting up all the components.

PubNub Functions

ChatEngine leverages the PubNub’s serverless infrastructure called Functions, to spawn off the backend for this chat app. And since it is serverless, it can be brought to life within a few seconds.

When you create a chat app through the ChatEngine framework, the magic of PubNub function deploys the backend for you which is instantly available and supports all the standard features required by a chat room application. It is completely hidden from the user and no explicit setup or coding is needed for this.

ChatEngine Frontend

The default chat client code is already available from the ChatEngine quick start guide, and we can use that as a base for the modified, speech-enabled chat app. However, we need a few changes to make this happen.

Chat UI

At first, there is a small change in the header portion of chat UI to accommodate the speech activation button icon.

<div class="chat-header clearfix">
                <img src="https://s3-us-west-2.amazonaws.com/s.cdpn.io/195612/chat_avatar_01_green.jpg" alt="avatar" />
                <div class="chat-about">
                    <div class="chat-with">ChatEngine Demo Chat</div>
                </div>
                 <div id="speechButton" class="speech-button"><img src="/speech-icon.png"></div>
            </div>

And here is the CSS class to style this icon.

.speech-button {

  float: right;
  margin-top: 6px;
  background-color: #d4e2dd;
  
}

Speech Activation for Chat App

HTML5 <audio> and <video> tags are the standard ways of embedding media controls on web apps. Since the app must be capable of playing the speech equivalent of chat messages, we have used the audio tag.

<audio id="player">
                
</audio>

Now, let’s move to the JavaScript part of the chat app. We first need to check for the browser compatibility for supported audio media formats.

The default ChatEngine initialization is now subjected to another pre-condition that initializes the audio support for the browser. And finally, after the ChatEngine initialization, we can hook in the click event for the icon to activate/deactivate the speech feature.

var AUDIO_FORMATS = {
            'ogg_vorbis': 'audio/ogg',
            'mp3': 'audio/mpeg',
            'pcm': 'audio/wave; codecs=1'
        };

var supportedFormats;
var player;
var speechEnabled = false;

// this is our main function that starts our chat app
const init = () => {
  
  //First things first, check for the the browser's audio capabilities
  player = document.getElementById('player');
  supportedFormats = getSupportedAudioFormats(player);

  if (supportedFormats.length === 0) {
      submit.disabled = true;
      alert('The web browser in use does not support any of the' +
            ' available audio formats. Please try with a different' +
            ' one.');
  } else {

    // connect to ChatEngine with our generated user
    ChatEngine.connect(newPerson.uuid, newPerson);

    // when ChatEngine is booted, it returns your new User as `data.me`
    ChatEngine.on('$.ready', function(data) {

        // store my new user as `me`
        me = data.me;

        // create a new ChatEngine Chat
        myChat = new ChatEngine.Chat('chatengine-demo-chat');

        // when we recieve messages in this chat, render them
        myChat.on('message', (message) => {
            renderMessage(message);
        });

        // when a user comes online, render them in the online list
        myChat.on('$.online.*', (data) => {   
          $('#people-list ul').append(peopleTemplate(data.user));
        });

        // when a user goes offline, remove them from the online list
        myChat.on('$.offline.*', (data) => {
          $('#people-list ul').find('#' + data.user.uuid).remove();
        });

        // wait for our chat to be connected to the internet
        myChat.on('$.connected', () => {

            // search for 50 old `message` events
            myChat.search({
              event: 'message',
              limit: 50
            }).on('message', (data) => {
              
              console.log(data)
              
              // when messages are returned, render them like normal messages
              renderMessage(data, true);
              
            });
          
        });

        // bind our "send" button and return key to send message
        $('#sendMessage').on('submit', sendMessage)

      });

      $("#speechButton").click(function(){

        if(speechEnabled){
          speechEnabled = false;
          $("#speechButton").css("background-color","#d4e2dd");

        } else {
          speechEnabled = true;
          $("#speechButton").css("background-color","#4ceab1");
        }

      })

  }

  
};

Speech Synthesis for Chat App

The HTML5 audio tag has the ability to stream audio from a URL that returns a chunked HTTP response containing media content types.

Before rendering every chat message, the app will check for the speechEnabled flag. If it is enabled then it will make a request to the streaming server and play the speech received in response. Here is how the default renderMessage( ) function of the chat app looks like after speech enablement.

// render messages in the list
const renderMessage = (message, isHistory = false) => {

    // use the generic user template by default
    let template = userTemplate;

    // if I happened to send the message, use the special template for myself
    if (message.sender.uuid == me.uuid) {
        template = meTemplate;
    }

    let el = template({
        messageOutput: message.data.text,
        time: getCurrentTime(),
        user: message.sender.state
    });

    console.log(message.data);

    // render the message
    if(isHistory) {
      $('.chat-history ul').prepend(el); 
    } else {
      
      if(speechEnabled && message.sender.uuid != me.uuid){

        player.src = '/read?voiceId=' +
                        encodeURIComponent("Aditi") +
                        '&text=' + encodeURIComponent(message.data.text) +
                        '&outputFormat=' + supportedFormats[0];
        player.play();

      }

      $('.chat-history ul').append(el);

    }
  
    // scroll to the bottom of the chat
    scrollToBottom();

};

Streaming Server for Test to Speech Conversion

As mentioned earlier, we’re using Amazon Polly to enable text-to-speech conversion. You can refer to the sample python server that demonstrates how the Polly service is called and the binary audio stream is returned to the chat client.

The server code is derived from this sample server application. Here is a quick code walkthrough of the main functionalities of this server app.

App Hosting

The server hosts the chat app and URL routes are defined for all the resources used by this app.

PROTOCOL = "http"
ROUTE_INDEX = "/index.html"
ROUTE_VOICES = "/voices"
ROUTE_READ = "/read"
ROUTE_CSS = "/chat.css"
ROUTE_JS = "/chat.js"
ROUTE_IMG = "/speech-icon.png"


def do_GET(self):
        """Handles GET requests"""

        # Extract values from the query string
        path, _, query_string = self.path.partition('?')
        query = parse_qs(query_string)

        response = None

        print(u"[START]: Received GET for %s with query: %s" % (path, query))

        try:
            # Handle the possible request paths
            if path == ROUTE_INDEX or path == ROUTE_CSS or path == ROUTE_JS or path == ROUTE_IMG:
                response = self.route_index(path, query)
            elif path == ROUTE_VOICES:
                response = self.route_voices(path, query)
            elif path == ROUTE_READ:
                response = self.route_read(path, query)
            else:
                response = self.route_not_found(path, query)

            self.send_headers(response.status, response.content_type)
            self.stream_data(response.data_stream)

        except HTTPStatusError as err:
            # Respond with an error and log debug
            # information
            if sys.version_info >= (3, 0):
                self.send_error(err.code, err.message, err.explain)
            else:
                self.send_error(err.code, err.message)

            self.log_error(u"%s %s %s - [%d] %s", self.client_address[0],
                           self.command, self.path, err.code, err.explain)

        print("[END]")

Amazon Polly Initialization

When a chat client is served, the server also initializes the boto3 AWS python library through which we can access the Amazon Polly service.

polly = boto3.Session(
                aws_access_key_id="<AWS USER ACCESS KEY>",                     
    aws_secret_access_key="<AWS USER SECRET KEY>",
    region_name='us-west-2').client('polly')

There are a few things that need to happen behind the scenes to access the Poly service via the <AWS USER ACCESS KEY> & <AWS USER SECRET KEY> parameters. Refer to this README file section for setting up your AWS account and the prerequisites for accessing the Polly service.

Streaming Request

This is where the real magic happens. The chat app invokes a specific URL endpoint, “/read”, to request for text to speech conversion. This is a HTTP GET call and the text to be converted is supplied as a parameter. That is where the Amazon Polly kicks in and returns the binary stream of the audio containing the synthesized speech.

def route_read(self, path, query):
        """Handles routing for reading text (speech synthesis)"""
        # Get the parameters from the query string
        text = self.query_get(query, "text")
        voiceId = self.query_get(query, "voiceId")
        outputFormat = self.query_get(query, "outputFormat")

        # Validate the parameters, set error flag in case of unexpected
        # values
        if len(text) == 0 or len(voiceId) == 0 or \
                outputFormat not in AUDIO_FORMATS:
            raise HTTPStatusError(HTTP_STATUS["BAD_REQUEST"],
                                  "Wrong parameters")
        else:
            try:
                # Request speech synthesis
                response = polly.synthesize_speech(Text=text,
                                                    VoiceId=voiceId,
                                                    OutputFormat=outputFormat)
            except (BotoCoreError, ClientError) as err:
                # The service returned an error
                raise HTTPStatusError(HTTP_STATUS["INTERNAL_SERVER_ERROR"],
                                      str(err))

            return ResponseData(status=HTTP_STATUS["OK"],
                                content_type=AUDIO_FORMATS[outputFormat],
                                # Access the audio stream in the response
                                data_stream=response.get("AudioStream"))

Every time the chat app sends a request to the streaming server, this code is executed and a synthesized speech is generated on the fly and streamed back to the app. As awesome as it may seem, this is all we need, to enable speech synthesis capabilities to this chat app.

Speech-enablement Beyond Accessibility

Even beyond accessibility, this feature can also help in many use cases, that require voice prompts for background apps and specific events.

Othe use cases for the Amazon Polly Block that invokes Amazon Polly service could be used to generate speech samples for specific text. This would be ideal for applications that require the generation of voice commands from a pre-defined list of text messages.

Try PubNub Today