Identification Methods for Generative Texts

This article covers methods to identify Generative Texts created through Gen AI tools. It is becoming a common practice for employees, students, and the public at large to simply pass on their work to Chat GPT and ask it to write out everything from a letter to a farewell speech to even apology letters. There are ways by which you may detect if the text was written by a human or a machine. Some of these are explained and highlighted below.

What is generative text?

AI-generated text is text that is created with the use of Artificial Intelligence Systems. The most common type of such system is the Large Language Model or LLM. These AI-based systems are trained on very large datasets that allow these systems to learn how we compose sentences. They can mimic human communication structure, style, and methods to produce new text that is virtually indescribable from human-generated ones.

Human Identification

While AI has become increasingly sophisticated, humans can still spot certain patterns and inconsistencies that indicate AI-generated text. Here are some key indicators:

  1. Lack of Nuance and Emotion: Overly formal or stiff language: AI often struggles to replicate the nuances of human emotion and expression.
  2. Absence of personal anecdotes or experiences: Humans often share personal stories to connect with their audience.
  3. Repetitive Patterns and Lack of Creativity: Repeated phrases or sentence structures: AI might overuse certain patterns due to its reliance on learned data.
  4. Lack of originality or creativity: AI can struggle to come up with truly unique ideas or perspectives.
  5. Inconsistent Tone and Style: Sudden shifts in writing style or tone: Humans tend to maintain a consistent voice, while AI might struggle with transitions.
    Unnatural or awkward phrasing: AI may generate sentences that sound unnatural or forced.
  6. Factual Errors and Inconsistencies: Incorrect information or contradictions: AI can sometimes make mistakes or contradict itself due to errors in its training data.
  7. Lack of Contextual Understanding: Difficulty understanding nuances or cultural references: AI may struggle to grasp subtle meanings or cultural context.
  8. Specific Indicators: Overuse of certain words or phrases: AI might have a tendency to overuse specific words or phrases that it has learned from its training data.
  9. Unnatural sentence structure: AI may generate sentences that are grammatically correct but sound awkward or unnatural.

Fingerprinting Methods

Another popular method is based on fingerprinting techniques that can utilize a pre-factor watermark embedded in the AI-based text. The intention is to have a non-identifiable trace in the text that humans may not realize but tools can detect these traces. It is similar to image watermarking used in documents.

Certain non-regular features are introduced in the text preemptively. For instance, OpenAI introduced a fingerprinting method to identify AI text through which it would incorporate texts with a slightly higher proportion of words starting with a certain letter than what is found in the natural text. These would eventually find usage in fields such as open banking and finance.

In a series of recent updates, all major LLMs have introduced invisible text markers in AI-generated text. Most commonly used markers are invisible spaces and em-dash. This type of fingerprinting and detection method is discussed in section 4 below.

AI-based tools to identify AI-generated text

Some of the publicly available tools to identify text are ZeroGPT and Turnitin. ZeroGPT need users to register for their service after which users can upload a document or paste text to identify it was generated by AI. Similarly, Turnitin which is an extremely popular plagiarism detection software can also be used. However, not all subscription plans of Turnitin has this feature.

Fingerprint-based detection methods for AI-generated text aim to identify unique patterns or characteristics that distinguish human-written content from AI-generated content. These approaches typically involve analyzing the text at a granular level, looking for specific features that are indicative of AI involvement. Here are some common approaches:

1. Statistical Analysis:

  • Stylometric Analysis: This method examines the statistical properties of the text, such as word frequency, sentence length, and syntactic complexity. AI-generated text often exhibits different statistical patterns compared to human-written text.
  • Burstiness Analysis: This technique measures the distribution of words and phrases in the text. Human-written text tends to have more bursts of repeated words or phrases, while AI-generated text may exhibit a more uniform distribution.

2. Machine Learning and Deep Learning Models:

  • Neural Networks: Deep learning models, such as recurrent neural networks (RNNs) or transformers, can be trained to distinguish between human-written and AI-generated text. These models learn complex patterns and features in the data to make accurate predictions.
  • Support Vector Machines (SVMs): SVMs are machine learning algorithms can be used to classify text based on its features. By extracting relevant features from the text, SVMs can effectively differentiate between human and AI-generated content.
  • Generative Adversarial Networks (GANs): GANs can be used to generate realistic AI-generated text, which can then be compared to the original text to identify discrepancies.
  • Autoencoders: Autoencoders can be trained to learn the underlying structure of human-written text. By comparing the reconstructed text from an autoencoder to the original text, it’s possible to detect anomalies that indicate AI-generated content.

3. Natural Language Processing (NLP) Techniques:

  • Part-of-Speech Tagging: This technique identifies the grammatical function of words in a sentence. AI-generated text may exhibit different patterns of part-of-speech tags compared to human-written text which can be detected, albeit not that easily.
  • Named Entity Recognition (NER): NER identifies named entities in the text, such as people, organizations, or locations. AI-generated text may contain errors or inconsistencies in named entity recognition.

4. Hidden markers in text

Plagiarism and submission of AI based text was highlighted as a major concern since the early days of AI. One of the commonly used techniques by LLM chatbots are hidden markers in the text. In a public statement by Hendrik Kirchner of OpenAI, it was mentioned that they were working on prototype for these types of markers as early as November 2022. Google DeepMind’s SynthID-Text was one of the first attempts at inserting these types of markers in text. They started deploying it in their public models of Gemini and Gemini advanced chatbots in around October 2024.

One of these types of markers are the zero width space (U+200B), zero width non joiner (U+200C), or zero width joiner (U+200D) that take up no visible space but are still valid Unicode characters that LLMs process. These can be seen in advanced text editors like Sublime. Another commonly used method is the use of long dash character that ChatGPT uses. It looks like this — . Now you know it when you see it next time. There are about 87 different types of invisible space characters that can potentially be used. Apart from these hidden characters, there could also be some unintended inclusion of Generative AI characters in the text. These are summarized below:

MarkerLLM ModelPurpose
### or <|im_start|>OpenAI Modelsbinding the text
[INST] and [/INST]Llama 2Encapsulates the system prompts
‘<GPT-2 , GPT-3End of text
[SEP]BERTspacer between text segments
Commonly used markers in Generative text: note that most of these markers are stripped in the Chat UI when users copy the text.

It’s important to note that these methods are not foolproof, and as AI technology continues to advance, new techniques may be developed to improve detection accuracy. Additionally, combining multiple approaches can often provide more reliable results depending on your goal setting.

Sharing is caring!

1 thought on “Identification Methods for Generative Texts”

  1. Ah, the delightful dance of detection! AIs trying to hide like a sneaky watermarked photo, while humans peer intently for invisible spaces and inconsistent emotions. Its like trying to spot a phony at a party – fun, a bit absurd, and the AIs probably rolling its digital eyes at us. Were all just armchair detectives with tools like ZeroGPT, searching for grammatical oddities and bursts of creativity (or lack thereof). Honestly, if AI gets good enough to write its own obituary claiming it cant be detected, we might just believe it!

    Reply

Leave a Comment