Testing MCP Servers

0 MIN READ • Stephen Blum on Jun 12, 2025

So you've built an MCP (Model Context Protocol) server. Great! Now comes the fun part: making sure it actually works. After building the PubNub MCP server and putting it through its paces, I've learned a thing or two about testing these beasts. Let me walk you through what I've discovered.

Two Flavors of Testing: Technical and Behavioral

When testing MCP servers, you're really testing two different things. First, there's the technical correctness: does your server expose the right tools and return valid responses? Second, there's the behavioral aspect: can AI models actually use your tools effectively to solve real problems?

Technical Testing: The Foundation

The bread and butter of MCP testing looks a lot like traditional API testing. In our test.js file, we're doing exactly what you'd expect:

const client = new Client({ name: 'test-client', version: '1.0.0' });

const transport = new StdioClientTransport({ command: 'node', args: ['index.js'] });

await client.connect(transport);

We spin up a client, connect to our server, and start hammering it with requests. The beauty of MCP is that it's just JSON-RPC under the hood, so your testing patterns should feel familiar.

What's not so obvious is the sheer volume of edge cases you need to cover. Our test suite checks:

Tool discovery: Does listTools() return all expected tools?
Parameter validation: What happens with missing required params?
Error handling: Invalid enum values, empty arrays, unknown documents
Content validation: Not just "did it return something," but "did it return the right something"

Here's a pattern I found particularly useful: testing default behaviors:

// Test with explicit parameter

const explicitResult = await client.callTool({

name: 'read_pubnub_sdk_docs',

arguments: { language: 'javascript', apiReference: 'configuration' }

});

// Test with default parameter

const defaultResult = await client.callTool({

name: 'read_pubnub_sdk_docs',

arguments: { language: 'javascript' }

});

This catches those subtle bugs where your defaults aren't actually working as intended.

Behavioral Testing: The Wild West

Here's where things get interesting. Technical tests tell you if your server works, but they don't tell you if AI models can actually use it effectively. Enter test_tools_model.sh: our attempt to validate that real AI models can invoke our tools correctly.

The approach is delightfully straightforward: give an AI model access to your tools and see if it knows what to do with them. We fire a bunch of prompts at GPT-4 and check whether it chooses to call our tools:

PROMPTS=(

"Write a PubNub App that tracks user presence..."

"Create a PubNub-powered social mapping app..."

"Please retrieve the API reference for JavaScript SDK..."

)

What I love about this approach is that it surfaces problems you'd never catch in unit tests. For instance, maybe your tool description is technically correct but confusing to AI models. Or perhaps your parameter names don't match what models expect.

In the Details

Error Handling That Actually Helps

One thing I learned the hard way: your error messages need to be precise. When we test invalid parameters, we're not just checking that an error occurs: we're validating the error message contains specific text:

assert(

err.message.includes('Invalid arguments for tool read_pubnub_resources'),

`Unexpected error for invalid document: ${err.message}`

);

This might seem pedantic, but trust me: when a model gets a vague error message, it'll just keep retrying with equally wrong parameters.

Content Validation Beyond "Something Returned"

A rookie mistake is just checking result.content.length > 0. You need to validate the shape and substance of your responses:

assert(

sdkDefaultResult.content[0].text.includes('# PubNub Presence Best Practices'),

"Expected presence best practices header in default output."

);

This catches cases where your tool returns something, but not the right something.

The Multiple Channels Problem

Real-world usage is messier than your happy path tests. Our MCP server needs to handle requests for multiple channels, empty channel lists, and channel groups. Each of these scenarios needs its own test case because each one can fail in unique ways.

Terminal Colors: Because Humans

One small quality-of-life improvement that made a huge difference:

const FG_GREEN = '\x1b[32m';

const FG_RED = '\x1b[31m';

console.log = (...args) => {

const msg = args.join(' ');

if (/successfully|passed/.test(msg)) {

originalLog(`${FG_GREEN}${BOLD}${msg}${RESET}`);

} else {

originalLog(...args);

}

};

When you're running dozens of tests, being able to quickly spot failures saves your sanity. Green for success, red for failure: it's the little things.

Testing Strategy That Scales

Here's the testing strategy that's worked for us:

Start with the happy path: basic functionality for each tool
Add parameter validation: required params, optional params, defaults
Test edge cases: empty arrays, invalid enums, missing documents
Validate error handling: both that errors occur and that they're helpful
Add behavioral tests: can models actually use your tools?

The key insight is that MCP testing is really two different disciplines. Your technical tests should be comprehensive and fast: they're your safety net during development. Your behavioral tests should be realistic and representative: they tell you if your server is actually useful in practice.

More Value from Testing MCP Servers

Testing MCP servers isn't rocket science, but it's also not trivial. The protocol itself is straightforward, but the interaction between AI models and your tools creates a whole new class of testing challenges.

Start with solid technical tests to catch the obvious bugs, then layer on behavioral tests to catch the subtle ones. And remember: if a model can't figure out how to use your tool, neither will your users.

The goal isn't just a working MCP server: it's a server that AI models can wield effectively. That's a much higher bar, but it's also what makes the difference between a toy and a tool that people actually want to use.