Diffbot is a powerful API that extracts web data and content from articles, products, discussions, images, and more. Using AI, computer vision, and natural language processing, the API understands objects from any webpage and retrieves clean, structured data.
A perfect fit for the PubNub BLOCKS Catalog, our new Diffbot block for analyzing and extracting web data allows you to process incoming real-time messages with attached URLs, and amend website contextual information to that stream. For example, if you were to submit a URL to a New York Times article, Diffbot and PubNub would output a message that includes article type, language, operating system, and titles.
So, what exactly is this content analysis that Diffbot provides? In this case, content analysis refers to taking a piece of text and trying to extract features such as author and language, assign tags with numeric scores (indicating confidence), and even provide semantic categories and relationships. Content analysis has a number of challenges, including incomplete context, ambiguity, sarcasm, slang, international languages and issues with domain-specific text analysis.
In this tutorial, we’ll dive into a simple example of how to enable textual content analysis in a real-time AngularJS web application using 25 lines of the PubNub JavaScript BLOCK and 74 lines of HTML and JavaScript. In the end, you’ll have an app with this basic functionality:
As we prepare to explore our sample web application with content analysis features, let’s check out the underlying Diffbot Analysis API.
Automated text and content analysis services are quite challenging to build and train on your own; they require substantial effort and engineering resources to maintain across a diverse array of application domains and user languages (not to mention immense compute resources and training sets!). In the meantime, the
On the other hand, the Diffbot Analysis APIs make it easy to enable your applications with straightforward text content analysis.
Looking closer at the APIs, text content analysis is just the beginning. There are a lot of API methods available for things like image and video processing and categorization, discussion analysis and more. It really is a powerful tool for distilling meaning from text, images and video. In this article though, we’ll keep it simple and just implement a basic text content analysis for user-provided URLs.
Since you’re reading this at PubNub, we’ll presume you have a real-time application use case in mind. In the sections below, we’ll dive into the content analysis use case, saving other web service use cases for the future.
To get started, you’ll need a PubNub account, which includes your unique publish and subscribe keys. Once you do that, the publish and subscribe keys look like UUIDs and start with “pub-c-” and “sub-c-” prefixes respectively. Keep those handy – you’ll need to plug them in when initializing the PubNub object in your HTML5 app below.
PubNub plays together really well with JavaScript because the PubNub JavaScript SDK is extremely robust and has been battle-tested over the years across a huge number of mobile and backend installations. The SDK is currently on its 4th major release, which features a number of improvements such as isomorphic JavaScript, new network components, unified message/presence/status notifiers, and much more.
NOTE: for compatibility with the PubNub AngularJS SDK, our UI code will use the PubNub JavaScript v3 API syntax. We expect the AngularJS API to be v4-compatible soon. In the meantime, please stay alert when jumping between different versions of JS code!
Next you’ll need a Diffbot account. Head over to the Diffbot signup form and sign up for a free trial, and make note of the API credentials (client token) sent to the registration email address.
Next is getting started with PubMub BLOCKS.
That’s all it takes to create your serverless code running in the cloud with BLOCKS!
You’ll want to grab the 25 lines of BLOCK JavaScript and save them to a file called pubnub_diffbot_block.js
. It’s available as a Gist on GitHub for your convenience.
First up, we declare our dependency on xhr
and query
export default request => { let xhr = require('xhr'); let query = require('codec/query_string');
Next, we set up variables for accessing the service (the client token from previous steps and API url).
let clientToken = 'YOUR_CLIENT_TOKEN'; let apiUrl = 'https://api.diffbot.com/v3/analyze';
Next, we set up the HTTP params for the analysis API request. We use a GET request to submit the data (by default). We use the client token to authenticate our request to the API. We pass the URL attribute from the message.
let queryParams = { token: clientToken, url: request.message.url };
Next, we create the URL from the given parameters.
let url = apiUrl + '?' + query.stringify(queryParams);
Finally, we call the analysis endpoint with the given data, decorate the message with a diffbotResponse
value containing the parsed JSON analysis data, and catch any errors and log to the BLOCKS console.
return xhr.fetch(url) .then((response) => { let json = JSON.parse(response.body); request.message.diffbotResponse = json; return request.ok(); }) .catch((err) => { console.log('error happened for XHR.fetch', err); return request.ok(); }); };
All in all, it doesn’t take a lot of code to add text content analysis to our application.
OK, let’s move on to the UI!
You’ll want to grab these 74 lines of HTML & JavaScript and save them to a file called pubnub_diffbot_ui.html
.
The first thing you should do after saving the code is to replace two values in the JavaScript:
If you don’t, the UI will not be able to communicate with anything and probably clutter your console log with entirely too many errors.
For your convenience, this code is also available as a Gist on GitHub, and a Codepen as well.
First up, we have the JavaScript code & CSS dependencies of our application.
<!doctype html> <html> <head> <script src="https://cdn.pubnub.com/pubnub-3.15.1.min.js"></script> <script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.5.6/angular.min.js"></script> <script src="https://cdn.pubnub.com/sdk/pubnub-angular/pubnub-angular-3.2.1.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/underscore.js/1.8.3/underscore-min.js"></script> <link rel="stylesheet" href="//netdna.bootstrapcdn.com/bootstrap/3.0.2/css/bootstrap.min.css" /> <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.6.3/css/font-awesome.min.css" /> </head> <body>
For folks who have done front-end implementation with AngularJS before, these should be the usual suspects:
In addition, we bring in 2 CSS features:
Overall, we were pretty pleased that we could build a nifty UI with so few dependencies. And with that… on to the UI!
Here’s what we intend the UI to look like:
The UI is pretty straightforward – everything is inside a div
tag that is managed by a single controller that we’ll set up in the AngularJS code.
<div class="container" ng-app="PubNubAngularApp" ng-controller="MyTextCtrl"> <pre> NOTE: make sure to update the PubNub keys below with your keys, and ensure that the text analysis BLOCK is configured properly! </pre> <h3>MyText Content Analysis</h3>
We provide a simple text input for a URL to send to the PubNub channel as well as a button to perform the publish()
action.
<input ng-model="toSend" placeholder="type URL here" /> <input type="button" ng-click="publish()" value="Send!" />
Our UI consists of a simple list of messages. We iterate over the messages in the controller scope using a trusty ng-repeat
. Each message includes the original URL as well as the text analysis including tags, content type, language, and title. For simplicity, we just display the first object detected (hence objects[0]
).
<ul> <li ng-repeat="message in messages track by $index"> <a href="{{message.url}}">{{message.diffbotResponse.title}}</a> by {{message.diffbotResponse.objects[0].author}} <br /> url: <a href="{{message.url}}">{{message.url}}</a> <br /> type: {{message.diffbotResponse.type}}; language: {{message.diffbotResponse.humanLanguage}} <br /> <span style="color:gray" ng-repeat="tag in message.diffbotResponse.objects[0].tags track by $index">{{tag.label}}={{tag.score}}; </span> </li> </ul> </div>
And that’s it – a functioning real-time UI in just a handful of code (thanks, AngularJS)!
Now we’re ready to dive into the AngularJS code. It’s not a ton of JavaScript, so this should hopefully be pretty straightforward.
The first lines we encounter set up our application (with a necessary dependency on the PubNub AngularJS service) and a single controller (which we dub MyTextCtrl
). Both of these values correspond to the ng-app
and ng-controller
attributes from the preceding UI code.
<script> angular.module('PubNubAngularApp', ["pubnub.angular.service"]) .controller('MyTextCtrl', function($rootScope, $scope, Pubnub) {
Next up, we initialize a bunch of values. First is an array of message objects which starts out empty. After that, we set up the channel as the channel name where we will send and receive real-time structured data messages.
NOTE: make sure this matches the channel specified by your BLOCK configuration and the BLOCK itself!
$scope.messages = []; $scope.channel = 'diffbot-channel';
We initialize the Pubnub
object with our PubNub publish and subscribe keys mentioned above, and set a scope variable to make sure the initialization only occurs once.
NOTE: this uses the v3 API syntax.
if (!$rootScope.initialized) { Pubnub.init({ publish_key: 'YOUR_PUB_KEY', subscribe_key: 'YOUR_SUB_KEY', ssl:true }); $rootScope.initialized = true; }
The next thing we’ll need is a real-time message callback called msgCallback
; it takes care of all the real-time messages we need to handle from PubNub. In our case, we have only one scenario – an incoming message containing text fragments with sentiment analysis. The concat()
operation should be in a $scope.$apply()
call so that AngularJS gets the idea that a change came in asynchronously.
var msgCallback = function(payload) { $scope.$apply(function() { $scope.messages.unshift(payload); }); };
The publish()
function takes the contents of the text input, publishes it as a structured data object to the PubNub channel, and resets the text box to empty.
$scope.publish = function() { Pubnub.publish({ channel: $scope.channel, message: {url:$scope.toSend} }); $scope.toSend = ""; };
Finally, in the main body of the controller, we subscribe()
to the message channel (using the JavaScript v3 API syntax) and bind the events to the callback function we just created.
Pubnub.subscribe({ channel: [$scope.channel], message: msgCallback });
We mustn’t forget close out the HTML tags accordingly.
}); </script> </body> </html>
Not too shabby for about 74 lines of HTML & JavaScript!
There are a couple other endpoints worth mentioning in the Diffbot API.
You can find detailed API documentation Clicking here.
All in all, we found it pretty easy to get started with content analysis using the API, and we look forward to using more of the deeper analysis features!
Thank you so much for joining us in the Content Analysis article of our BLOCKS and web services series! Hopefully it’s been a useful experience learning about content-enabled technologies. In future articles, we’ll dive deeper into additional web service APIs and use cases for other nifty services in real time web applications.
Stay tuned, and please reach out anytime if you feel especially inspired or need any help!
There are common underlying technologies for a dating app, and in this post, we’ll talk about the major technologies and designs...
Michael Carroll
How to use geohashing, JavaScript, Google Maps API, and BART API to build a real-time public transit schedule app.
Michael Carroll
How to track and stream real-time vehicle location on a live-updating map using EON, JavaScript, and the Mapbox API.
Michael Carroll