Build

Build a Text-to-Speech Chat App with Amazon Polly & PubNub

7 min read Shyam Purkayastha on Aug 23, 2019

Accessibility for computer applications is quite often an overlooked consideration during development. A user interface that works for one person might be completely constrained for a disabled person. Hence, as designers and developers, the onus is on us to empathize with one and all so that we can bring in inclusivity in all our creations.

In this blog post, we take a dig into this very aspect of how to make apps more accessible. Since chat-based applications continue to grow in adoption, why not bring in the accessibility angle to it? Here at PubNub, we make chat easy for developers to build. Let’s see how we can quickly enable text-to-speech capabilities for chat apps built with PubNub, to assist a blind user.

Technology That Assists Blind Users

Due to visual impairment, a blind user is severely constrained in reading the incoming chat messages. To assist this user, the most obvious solution is to use a text-to-speech engine that synthesizes speech from the incoming chat messages.

The technology behind text-to-speech is already available. But instead of locally processing the text, it’s better to leverage the cloud so that the system can continuously learn and improve its speech rendering capabilities. Amazon Polly is one such service offered under the AWS Machine Learning umbrella.

text-to-speech

One of Amazon Polly’s most valuable capabilities is the ability to stream natural speech, which is a big plus for a real-time application like chat. We are going to leverage this feature to build a speech-enabled chat client on top of PubNub.

PubNub and Text-To-Speech Chat Apps

If you are familiar with PubNub, you probably know how easy it is to launch your own chat app. In order to connect to the globally distributed, auto-scaling PubNub Data-Stream-Network, create free Pub/Sub keys with the form below. API keys work instantly, and they allow up to 1 million messages per month for free.

Be sure to enable Functions, Presence, and History for the key set in the PubNub Admin Dashboard before continuing!

Speech-enabled Chat App

To enable speech prompts for incoming chat messages, we can think of an icon that activates this feature. Something like this.

Amazon Polly Chat with PubNub

Notice the icon at the top right area of the chat UI. With this icon, we can enable or disable this feature with a simple click.

Note: The clicking of an icon is also a constraint for a blind user. As an enhancement, the UX designers can think of a keyboard shortcut or some other means of making it more accessible.

Tutorial Assets and Technology Overview

The source code of this speech-enabled chat app is available on GitHub. Most of the code is from a demo chat app with jQuery and PubNub. With a few modifications, you can easily build the speech synthesis feature on top of the default chat app.

You can clone the repository and follow along to get a sense of the code changes that are required.

Let’s look at the building blocks of this app by exploring the various technology components that work behind the scenes to deliver a functional chat experience. The README file accompanying the repository contains the steps for setting up all the components.

Functions

PubNub’s serverless infrastructure called Functions can be used to spawn off the backend for this chat app. And since it is serverless, it can be brought to life within a few seconds.

We can quickly deploy an infinitely scalable, globally distributed REST API in a few minutes using Functions. Let’s clone the AWS Polly microservice using Functions. Go to the PubNub AWS Polly Block page and import the Function to your account. Click the Try It Now button, then follow the instructions.

Import a Function screen

If you have not already, you need to make an AWS account and create API keys to access the AWS Polly API. Follow these steps to set up your AWS account for accessing Amazon Polly service

Step 1 : Setup an IAM user to access Amazon Polly service
Follow these steps to create an IAM user for Amazon Polly. Make sure that the IAM user has full permissions for accessing Amazon Polly service.

Step 2 : Download the IAM user credentials
Download the credentials file for the IAM user and save it. This file contains the AWS ACCESS KEY and AWS ACCESS SECRET KEY. Make a note of these two parameters.

AWS Polly API Keys

Now go to your Functions editor where you serverless JS event handler lives. Click on the MY SECRETS button on the left. Insert your AWS API keys here so the Functions server can use AWS Polly. The keys to insert in the Functions Vault are AWS_access_key and AWS_secret_key respectively.

Functions Vault Key Storage

PubNub Chat App Frontend

Let’s go over the front end code for making a browser chat app with PubNub. The UI here is made with HTML, CSS, JS, and jQuery. All of the code can be cloned from this GitHub Repository with the Text-To-Speech AWS Polly and PubNub chat app.

Chat UI

Let’s make a button that enables the speech to be read aloud. It is a toggled button.

<div class="chat-header clearfix">
                <img src="https://s3-us-west-2.amazonaws.com/s.cdpn.io/195612/chat_avatar_01_green.jpg" alt="avatar" />
                <div class="chat-about">
                    <div class="chat-with">PubNub with AWS Polly Demo Chat</div>
                </div>
                 <div id="speechButton" class="speech-button"><img src="./speech-icon.png"></div>
            </div>

And here is the CSS class to style this icon.

.speech-button {

  float: right;
  margin-top: 6px;
  background-color: #d4e2dd;
  
}

Speech Activation for Chat App

HTML5 <audio> and <video> tags are the standard ways of embedding media controls on web apps. Since the app must be capable of playing the speech equivalent of chat messages, we have used the audio tag.

<audio id="player">
                
</audio>

Now, let’s move to the JavaScript part of the chat app. We first need to check for the browser compatibility for supported audio media formats.

The default initialization is now subjected to another pre-condition that initializes the audio support for the browser. And finally, after the PubNub initialization, we can hook in the click event for the icon to activate/deactivate the speech feature.

var AUDIO_FORMATS = {
    'ogg_vorbis': 'audio/ogg',
    'mp3': 'audio/mpeg',
    'pcm': 'audio/wave; codecs=1'
};

var supportedFormats;
var player;
var speechEnabled = false;

// this is our main function that starts our chat app
const init = () => {
    //First things first, check for the the browser's audio capabilities
    player = document.getElementById('player');
    supportedFormats = getSupportedAudioFormats(player);

    if (supportedFormats.length === 0) {
        submit.disabled = true;
        alert('The web browser in use does not support any of the' +
            ' available audio formats. Please try with a different' +
            ' one.');
    } else {

        pubnub.addListener({
            message: function(message) {
                renderMessage(message);
            },
            presence: function(presenceEvent) {
                let type = presenceEvent.action;

                if (type === 'join') {
                    let person = generatePerson(true);
                    person.uuid = presenceEvent.uuid;
                    $('#people-list ul').append(peopleTemplate(person));
                } else if (type === 'leave' || type === 'timeout') {
                    $('#people-list ul').find('#' + presenceEvent.uuid).remove();
                }
            }
        });

        pubnub.subscribe({
            channels: [chatChannel],
            withPresence: true
        });

        //get old messages from history
        pubnub.history({
                channel: chatChannel,
                count: 3,
                stringifiedTimeToken: true
            },
            function(status, response) {
                if (response.messages && response.messages.length > 0) {
                    response.messages.forEach((historicalMessage) => {
                        renderMessage(historicalMessage, true);
                    })
                }
            }
        );

        $("#speechButton").click(function() {

            if (speechEnabled) {
                speechEnabled = false;
                $("#speechButton").css("background-color", "#d4e2dd");

            } else {
                speechEnabled = true;
                $("#speechButton").css("background-color", "#4ceab1");
            }

        })

        $('#sendMessage').on('submit', sendMessage)
    }
};

Speech Synthesis for Chat App

The HTML5 audio tag has the ability to stream audio from a URL that returns a chunked HTTP response containing media content types.

Before rendering every chat message, the app will check for the speechEnabled flag. If it is enabled then it will make a request to the streaming server and play the speech received in response. Here is how the default renderMessage( ) function of the chat app looks like after speech enablement.

// render messages in the list
const renderMessage = (message) => {

    // use the generic user template by default
    let template = userTemplate;

    // if I happened to send the message, use the special template for myself
    if (message.publisher === pubnub.getUUID()) {
        template = meTemplate;
    }

    let isHistory = false
    if (message && !message.message) {
        console.log(message)
        message = { message: message.entry, timetoken: message.timetoken }
        isHistory = true;
    }

    var messageJsTime = new Date(parseInt(message.timetoken.substring(0, 13)));

    let el = template({
        messageOutput: message.message.text,
        tt: messageJsTime.getTime(),
        time: parseTime(messageJsTime),
        user: message.publisher
    });

    console.log(message);

    if (speechEnabled && message.publisher != pubnub.getUUID()) {

        getPollyAudioForText(message.message.text).then((audio) => {
            player.src = audio
            player.play();
        })
    }

    chatHistoryUl.append(el);

    // chatHistoryUl.append(template(context));

    // Sort messages in chat log based on their timetoken (tt)
    chatHistoryUl
        .children()
        .sortDomElements(function(a, b) {
            akey = a.dataset.order;
            bkey = b.dataset.order;
            if (akey == bkey) return 0;
            if (akey < bkey) return -1; if (akey > bkey) return 1;
        })

    // scroll to the bottom of the chat
    scrollToBottom();

};

Streaming Server for Test to Speech Conversion

As mentioned earlier, we’re using Amazon Polly to enable text-to-speech conversion. The API is accessed via Functions. You need to copy the Function Endpoint URL using the functions editor. Click the copy URL button on the left side. This will be on the same screen that you used to insert your AWS API keys.

Paste the URL on the first line of chat.js

const pollyAudioPubNubFunction = 'YOUR_PUBNUB_FUNCTIONS_ENDPOINT_URL_HERE';

Also insert your PubNub API keys, so the Chat Messaging functionality will work (also in the chat.js file).

let pubnub = new PubNub({
    publishKey: 'YOUR_FREE_PUBNUB_PUBLISH_KEY_HERE',
    subscribeKey: 'YOUR_FREE_PUBNUB_SUBSCRIBE_KEY_HERE',
    uuid: newPerson.uuid
});

Every time the chat app sends a request to the Function, the client code is executed and a synthesized speech is generated on the fly and streamed back to the app. As awesome as it may seem, this is all we need, to enable speech synthesis capabilities to this chat app.

Speech-enablement Beyond Accessibility

Even beyond accessibility, this feature can also help in many use cases, that require voice prompts for background apps and specific events.

Other use cases for the Amazon Polly Block that invokes Amazon Polly service could be used to generate speech samples for specific text. This would be ideal for applications that require the generation of voice commands from a pre-defined list of text messages.

0