Build

Real-time Text Analysis & Categorization with Diffbot

9 min read Michael Carroll on Jan 30, 2017

Diffbot is a powerful API that extracts web data and content from articles, products, discussions, images, and more. Using AI, computer vision, and natural language processing, the API understands objects from any webpage and retrieves clean, structured data.

A perfect fit for the PubNub BLOCKS Catalog, our new Diffbot block for analyzing and extracting web data allows you to process incoming real-time messages with attached URLs, and amend website contextual information to that stream. For example, if you were to submit a URL to a New York Times article, Diffbot and PubNub would output a message that includes article type, language, operating system, and titles.

So, what exactly is this content analysis that Diffbot provides? In this case, content analysis refers to taking a piece of text and trying to extract features such as author and language, assign tags with numeric scores (indicating confidence), and even provide semantic categories and relationships. Content analysis has a number of challenges, including incomplete context, ambiguity, sarcasm, slang, international languages and issues with domain-specific text analysis.

Content Analysis Tutorial

In this tutorial, we’ll dive into a simple example of how to enable textual content analysis in a real-time AngularJS web application using 25 lines of the PubNub JavaScript BLOCK and 74 lines of HTML and JavaScript. In the end, you’ll have an app with this basic functionality:

pubnub_diffbot

As we prepare to explore our sample web application with content analysis features, let’s check out the underlying Diffbot Analysis API.

Diffbot Analysis API

pubnub content analysis blockAutomated text and content analysis services are quite challenging to build and train on your own; they require substantial effort and engineering resources to maintain across a diverse array of application domains and user languages (not to mention immense compute resources and training sets!). In the meantime, the

On the other hand, the Diffbot Analysis APIs make it easy to enable your applications with straightforward text content analysis.

Looking closer at the APIs, text content analysis is just the beginning. There are a lot of API methods available for things like image and video processing and categorization, discussion analysis and more. It really is a powerful tool for distilling meaning from text, images and video. In this article though, we’ll keep it simple and just implement a basic text content analysis for user-provided URLs.

Since you’re reading this at PubNub, we’ll presume you have a real-time application use case in mind. In the sections below, we’ll dive into the content analysis use case, saving other web service use cases for the future.

Obtaining your PubNub Developer Keys

To get started, you’ll need a PubNub account, which includes your unique publish and subscribe keys. Once you do that, the publish and subscribe keys look like UUIDs and start with “pub-c-” and “sub-c-” prefixes respectively. Keep those handy – you’ll need to plug them in when initializing the PubNub object in your HTML5 app below.

PubNub JavaScript SDK

PubNub plays together really well with JavaScript because the PubNub JavaScript SDK is extremely robust and has been battle-tested over the years across a huge number of mobile and backend installations. The SDK is currently on its 4th major release, which features a number of improvements such as isomorphic JavaScript, new network components, unified message/presence/status notifiers, and much more.

NOTE: for compatibility with the PubNub AngularJS SDK, our UI code will use the PubNub JavaScript v3 API syntax. We expect the AngularJS API to be v4-compatible soon. In the meantime, please stay alert when jumping between different versions of JS code!

Getting Started with Diffbot Analysis API

Next you’ll need a Diffbot account. Head over to the Diffbot signup form and sign up for a free trial, and make note of the API credentials (client token) sent to the registration email address.

Setting up the BLOCK

Next is getting started with PubMub BLOCKS.

Create BLOCK
  • Step 2: create a new BLOCK.
Create Event Handler
  • Step 3: paste in the BLOCK code from the next section and update the credentials with the Diffbot credentials from the previous steps above.
Paste BLOCK code
  • Step 4: Start the BLOCK, and test it using the “publish message” button and payload on the left-hand side of the screen.

That’s all it takes to create your serverless code running in the cloud with BLOCKS!

Diving into the Code – the BLOCK

You’ll want to grab the 25 lines of BLOCK JavaScript and save them to a file called pubnub_diffbot_block.js. It’s available as a Gist on GitHub for your convenience.

First up, we declare our dependency on xhr and query

export default request => {
    let xhr = require('xhr');
    let query = require('codec/query_string');

Next, we set up variables for accessing the service (the client token from previous steps and API url).

  let clientToken = 'YOUR_CLIENT_TOKEN';
  let apiUrl = 'https://api.diffbot.com/v3/analyze';

Next, we set up the HTTP params for the analysis API request. We use a GET request to submit the data (by default). We use the client token to authenticate our request to the API. We pass the URL attribute from the message.

  let queryParams = {
      token: clientToken,
      url: request.message.url
  };

Next, we create the URL from the given parameters.

  let url = apiUrl + '?' + query.stringify(queryParams);

Finally, we call the analysis endpoint with the given data, decorate the message with a diffbotResponse value containing the parsed JSON analysis data, and catch any errors and log to the BLOCKS console.

    return xhr.fetch(url)
        .then((response) => {
            let json = JSON.parse(response.body);
            request.message.diffbotResponse = json;
            return request.ok();
        })
        .catch((err) => {
            console.log('error happened for XHR.fetch', err);
            return request.ok();
        });
};

All in all, it doesn’t take a lot of code to add text content analysis to our application.

OK, let’s move on to the UI!

Diving into the Code – the User Interface

You’ll want to grab these 74 lines of HTML & JavaScript and save them to a file called pubnub_diffbot_ui.html.

The first thing you should do after saving the code is to replace two values in the JavaScript:

  • YOUR_PUB_KEY: with the PubNub publish key mentioned above.
  • YOUR_SUB_KEY: with the PubNub subscribe key mentioned above.

If you don’t, the UI will not be able to communicate with anything and probably clutter your console log with entirely too many errors.

For your convenience, this code is also available as a Gist on GitHub, and a Codepen as well.

Dependencies

First up, we have the JavaScript code & CSS dependencies of our application.

<!doctype html>
<html>
<head>
  <script src="https://cdn.pubnub.com/pubnub-3.15.1.min.js"></script>
  <script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.5.6/angular.min.js"></script>
  <script src="https://cdn.pubnub.com/sdk/pubnub-angular/pubnub-angular-3.2.1.min.js"></script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/underscore.js/1.8.3/underscore-min.js"></script>
  <link rel="stylesheet" href="//netdna.bootstrapcdn.com/bootstrap/3.0.2/css/bootstrap.min.css" />
  <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.3/css/font-awesome.min.css" />
</head>
<body>

For folks who have done front-end implementation with AngularJS before, these should be the usual suspects:

  • PubNub JavaScript client: to connect to our data stream integration channel.
  • AngularJS: were you expecting a niftier front-end framework? Impossible!
  • PubNub Angular JavaScript client: provides PubNub services in AngularJS quite nicely indeed.
  • Underscore.js: we could avoid using Underscore.JS, but then our code would be less awesome.

In addition, we bring in 2 CSS features:

  • Bootstrap: in this app, we use it just for vanilla UI presentation.
  • Font-Awesome: we love Font Awesome because it lets us use truetype font characters instead of image-based icons. Pretty sweet!

Overall, we were pretty pleased that we could build a nifty UI with so few dependencies. And with that… on to the UI!

The User Interface

Here’s what we intend the UI to look like:

Diffbot App UI

The UI is pretty straightforward – everything is inside a div tag that is managed by a single controller that we’ll set up in the AngularJS code.

<div class="container" ng-app="PubNubAngularApp" ng-controller="MyTextCtrl">
<pre>
NOTE: make sure to update the PubNub keys below with your keys,
and ensure that the text analysis BLOCK is configured properly!
</pre>
<h3>MyText Content Analysis</h3>

We provide a simple text input for a URL to send to the PubNub channel as well as a button to perform the publish() action.

<input ng-model="toSend" placeholder="type URL here" />
<input type="button" ng-click="publish()" value="Send!" />

Our UI consists of a simple list of messages. We iterate over the messages in the controller scope using a trusty ng-repeat. Each message includes the original URL as well as the text analysis including tags, content type, language, and title. For simplicity, we just display the first object detected (hence objects[0]).

<ul>
  <li ng-repeat="message in messages track by $index">
    <a href="{{message.url}}">{{message.diffbotResponse.title}}</a> by {{message.diffbotResponse.objects[0].author}}
    <br />
    url: <a href="{{message.url}}">{{message.url}}</a>
    <br />
    type: {{message.diffbotResponse.type}}; language: {{message.diffbotResponse.humanLanguage}}
    <br />
    <span style="color:gray" ng-repeat="tag in message.diffbotResponse.objects[0].tags track by $index">{{tag.label}}={{tag.score}}; </span>
  </li>
</ul>
</div>

And that’s it – a functioning real-time UI in just a handful of code (thanks, AngularJS)!

The AngularJS Code

Now we’re ready to dive into the AngularJS code. It’s not a ton of JavaScript, so this should hopefully be pretty straightforward.

The first lines we encounter set up our application (with a necessary dependency on the PubNub AngularJS service) and a single controller (which we dub MyTextCtrl). Both of these values correspond to the ng-app and ng-controller attributes from the preceding UI code.

<script>
angular.module('PubNubAngularApp', ["pubnub.angular.service"])
.controller('MyTextCtrl', function($rootScope, $scope, Pubnub) {

Next up, we initialize a bunch of values. First is an array of message objects which starts out empty. After that, we set up the channel as the channel name where we will send and receive real-time structured data messages.

NOTE: make sure this matches the channel specified by your BLOCK configuration and the BLOCK itself!

$scope.messages  = [];
$scope.channel   = 'diffbot-channel';

We initialize the Pubnub object with our PubNub publish and subscribe keys mentioned above, and set a scope variable to make sure the initialization only occurs once.

NOTE: this uses the v3 API syntax.

  if (!$rootScope.initialized) {
    Pubnub.init({
      publish_key: 'YOUR_PUB_KEY',
      subscribe_key: 'YOUR_SUB_KEY',
      ssl:true
    });
    $rootScope.initialized = true;
  }

The next thing we’ll need is a real-time message callback called msgCallback; it takes care of all the real-time messages we need to handle from PubNub. In our case, we have only one scenario – an incoming message containing text fragments with sentiment analysis. The concat() operation should be in a $scope.$apply() call so that AngularJS gets the idea that a change came in asynchronously.

  var msgCallback = function(payload) {
    $scope.$apply(function() {
      $scope.messages.unshift(payload);
    });
  };

The publish() function takes the contents of the text input, publishes it as a structured data object to the PubNub channel, and resets the text box to empty.

  $scope.publish = function() {
    Pubnub.publish({
      channel: $scope.channel,
      message: {url:$scope.toSend}
    });
    $scope.toSend = "";
  };

Finally, in the main body of the controller, we subscribe() to the message channel (using the JavaScript v3 API syntax) and bind the events to the callback function we just created.

  Pubnub.subscribe({ channel: [$scope.channel], message: msgCallback });

We mustn’t forget close out the HTML tags accordingly.

});
</script>
</body>
</html>

Not too shabby for about 74 lines of HTML & JavaScript!

Additional Features

There are a couple other endpoints worth mentioning in the Diffbot API.

You can find detailed API documentation Clicking here.

  • Analyze: analysis of web pages.
  • Article: detailed analysis of web articles.
  • Discussion : detailed analysis of forums and discussion pages.
  • Image: detailed analysis of web-based images.
  • Product: detailed analysis of web-based e-commerce product pages.
  • Video (Beta): analysis of web-based video files.

All in all, we found it pretty easy to get started with content analysis using the API, and we look forward to using more of the deeper analysis features!

Conclusion

Thank you so much for joining us in the Content Analysis article of our BLOCKS and web services series! Hopefully it’s been a useful experience learning about content-enabled technologies. In future articles, we’ll dive deeper into additional web service APIs and use cases for other nifty services in real time web applications.

Stay tuned, and please reach out anytime if you feel especially inspired or need any help!

0