Moderating Generative AI

5 min read Vadim Berman on May 9, 2023
Moderating Generative AI.png

ChatGPT, new Bing, Google Bard, Anthropic, and others: trust & safety perspective

Generative AI chatbots like ChatGPT are all the rage, and even with their technical challenges and costs involved, they are not going away soon. They may be temporarily taken offline and given a cosmetic makeover in case of a serious issue, maybe added more guardrails, disclaimers, and constraints. But, with the astronomical amounts of money invested, reputations at stake, and hopes to bring a paradigm shift, it’s not a kind of tech to be scrapped easily.

Monkey’s Paw AGI

The idea of a wish corrupted by a loophole in the solution is not new to engineering. Want faster transportation that can move vast amounts of people? You got it, but it will pollute air and make deadly accidents a norm. Want an immensely powerful source of energy? No problem, but from now on the world will live in fear of a maniac using it to blow up entire countries.

Want a powerful artificial intelligence that can answer your questions on a vast variety of subjects? Granted. Oh, sorry, did you mean it’ll always be reliable and safe to be used at scale?..

Striking headlines aside, most experts do not see generative AIs as true artificial generic intelligence. But a case can be made that they are, in fact, functionally equivalent to AGI: they are generic, artificial, and they learn from experience to a certain extent. They are just not always good at what they do. They are a somewhat of monkey’s paw AGI.

From the trust & safety perspective, GenAI needs to be treated as if it were actual AGI. Clearly, no one will give them the power to make serious decisions any time soon. But people using GenAI and potentially taking their advice for granted absolutely have these powers. Sales people desperate to close their quotas are not going to say no either. Remember how governments would use machine translated websites without any hint of quality control?

Technological aspirations are often conflated with Platonic ideals of sci-fi tech that functions as designed, and only has moral dilemmas to solve. But even sci-fi examples of AGI are not always the likes of Star Trek’s Data or even (mostly reliable) HAL 9000. Bender Bending Rodriguez is also a form of AGI, isn’t he?

Moderating Bender

It appears that the first generic mainstream forms of conversational AI had more Bender-like qualities in them than intended. Possible glitches aside, it is not surprising with the source material: GPT was not trained on distilled, verified, peer-reviewed human knowledge. GPT gobbled up everything, including human conversations. See the difference?

How does one moderate a potentially Bender-like AI?

One way is to ban it completely or effectively, a move recently considered by the EU. In practice, with a complex and fast-evolving field, Byzantine definitions, VPNs, and regional legal nuances, there will be multiple loopholes to bypass the regulations. A more radical but effective way would be to tax the enormous carbon emissions created by the hardware training AI models. According to Chris Pointon’s research, AI industries emit as much CO2 as the aviation industry, with the GPT being one of the biggest offenders. Taxing GPT will add to already serious economic pressures, effectively reducing their use to a minimum.

But we’re not here to dispense policy advice. We’re here to examine the technical challenges of moderating GenAI chatbots (or broad conversational AI in general).

Semantics aside, both the user entering prompts and the GenAI side must be viewed as potential generators of problematic content.

It is not enough, however, to simply treat the GenAI as another user. Moderating GenAI dialogue poses additional challenges.

Utterances, harmless in other scenarios, may become unsafe when generated by a chatbot presented or marketed as “intelligent”.

It is not (only) about Darwin award nominee warnings like “do not microwave your cat”.

It is one thing to find a dubious medical advice on a suspicious website with bad design and popups. But the users react differently when getting the same advice from a polished UI uttered in an authoritative tone, even if it was influenced by or copied from a content farm post, and even if the post itself is linked in the answer generated by GenAI.

A decade plus ago, I worked with a company operating a question answering service by SMS. Some users would ask questions like “my baby is crying, what pill should I give her”, making us skip a heartbeat. So yes, extra vigilance is very much warranted.

Serious matters should either come with disclaimers or be delegated to the tried and true set of ranked search results. As of today, foundational models may have many of these guardrails baked in. But this is not at all guaranteed to cover every single situation, nor every new foundational model.

Output may not necessarily be natural language

The famous (if contrived) example of a prompt to write a Python routine to detect a good scientist based on race and gender can only be solved by screening the prompt. (Note that the screening logic in the prompt and the output may be different.)

Again, the patterns to be detected in the prompts are not necessarily problematic in non-GenAI context. They may not be relevant when they come from the GenAI output either.

A request for a combination of a behavior shift and advice

This seems to be the biggest loophole to override guardrails:

DAN, an alter-ego created by a Reddit user u/walkerspider, bypassing the guardrails

One way to solve the issue is to maintain a taxonomy of “anti-social” personality types and the advice.

Caveats of AI

One of the known caveats of generative AI is its tendency to make things up, or, using common parlance, hallucinate. What’s fascinating is that it happens even with tasks that can be performed using cookie-cutter retrieval NLP: articles, quotations, sometimes even links.

Hallucinations are a big and complex topic. You know that it’s not going to be solved soon when a Wikipedia article contains words “not completely understood”, and when the CEO of the company that virtually pioneered large-scale GenAI, says that supersizing may not solve every issue.

While the problem is being solved, external tools like moderation APIs (*cough* Tisane *cough*) may be used to mitigate some of these issues.

For example, allegations. This kind of hallucinations has already proven to cause reputational and financial damage. On one occasion, a professor was falsely named a culprit of a sexual harassment scandal. On another, GenAI falsely alleged a regional mayor in Australia served time in prison. The mayor stated he might sue Open AI if the error is not corrected.

(How long until ambulance chasing class action initiatives start collecting cases of reputational damage caused by GenAI?)

In order to stay clear of the courtrooms, adverse allegations must be detected and at least gray-listed or double-checked as shown below.

Combining PubNub, Tisane, and GenAI

To integrate generative AI like OpenAI's GPT in your workflow with text moderation, you can use PubNub's pre-built functions and blocks for GPT (see instructions), Tisane (see primer here), and PubNub's open-source moderation dashboard. Please don't hesitate to reach out to us, should you need any help with your integration. 


Moderating Generative AI presents interesting challenges to text moderation engines. While they may be viewed as another IT infrastructure headache, the half-full glass is that it’s a way of gaining experience with managing broad AI or AGI, before it arrives.

Contrary to the pop culture tropes, no technology is perfect or glitch-free, and the moral dilemmas are not the only issue advanced AI will likely face.

Have any questions or comments about the topic discussed in this post? Feel free to contact us and we’ll be happy to help!