Identification Methods for Generative Texts

This article covers methods to identify Generative Texts created through Gen AI tools. It is becoming a common practice for employees, students, and the public at large to simply pass on their work to Chat GPT and ask it to write out everything from a letter to a farewell speech to even apology letters. There are ways by which you may detect if the text was written by a human or a machine. Some of these are explained and highlighted below.

Contents

What is generative text?

AI-generated text is text that is created with the use of Artificial Intelligence Systems. The most common type of such system is the Large Language Model or LLM. These AI-based systems are trained on very large datasets that allow these systems to learn how we compose sentences. They can mimic human communication structure, style, and methods to produce new text that is virtually indescribable from human-generated ones.

At the very core of such models is the large corpus of training data. The models understands grammer, context and subjective knowledge based upon this large corpus of text and adjusts it parameters to create new text based on the prompts.
▶ For instance, GPT-4, a popular LLM uses about 1.76 Trillion paramets

Human Identification

While AI has become increasingly sophisticated, humans can still spot certain patterns and inconsistencies that indicate AI-generated text. Here are some key indicators:

Lack of Nuance and Emotion: Overly formal or stiff language: AI often struggles to replicate the nuances of human emotion and expression.
Absence of personal anecdotes or experiences: Humans often share personal stories to connect with their audience.
Repetitive Patterns and Lack of Creativity: Repeated phrases or sentence structures: AI might overuse certain patterns due to its reliance on learned data.
Lack of originality or creativity: AI can struggle to come up with truly unique ideas or perspectives.
Inconsistent Tone and Style: Sudden shifts in writing style or tone: Humans tend to maintain a consistent voice, while AI might struggle with transitions.
Unnatural or awkward phrasing: AI may generate sentences that sound unnatural or forced.
Factual Errors and Inconsistencies: Incorrect information or contradictions: AI can sometimes make mistakes or contradict itself due to errors in its training data.
Lack of Contextual Understanding: Difficulty understanding nuances or cultural references: AI may struggle to grasp subtle meanings or cultural context.
Specific Indicators: Overuse of certain words or phrases: AI might have a tendency to overuse specific words or phrases that it has learned from its training data.
Unnatural sentence structure: AI may generate sentences that are grammatically correct but sound awkward or unnatural.

Fingerprinting Methods

Another popular method is based on fingerprinting techniques that can utilize a pre-factor watermark embedded in the AI-based text. The intention is to have a non-identifiable trace in the text that humans may not realize but tools can detect these traces. It is similar to image watermarking used in documents.

Certain non-regular features are introduced in the text preemptively. For instance, OpenAI introduced a fingerprinting method to identify AI text through which it would incorporate texts with a slightly higher proportion of words starting with a certain letter than what is found in the natural text. These would eventually find usage in fields such as open banking and finance.

AI-based tools to identify AI-generated text

Some of the publicly available tools to identify text are ZeroGPT and Turnitin. ZeroGPT need users to register for their service after which users can upload a document or paste text to identify it was generated by AI. Similarly, Turnitin which is an extremely popular plagiarism detection software can also be used. However, not all subscription plans of Turnitin has this feature.

Fingerprint-based detection methods for AI-generated text aim to identify unique patterns or characteristics that distinguish human-written content from AI-generated content. These approaches typically involve analyzing the text at a granular level, looking for specific features that are indicative of AI involvement. Here are some common approaches:

1. Statistical Analysis:

Stylometric Analysis: This method examines the statistical properties of the text, such as word frequency, sentence length, and syntactic complexity. AI-generated text often exhibits different statistical patterns compared to human-written text.
Burstiness Analysis: This technique measures the distribution of words and phrases in the text. Human-written text tends to have more bursts of repeated words or phrases, while AI-generated text may exhibit a more uniform distribution.

2. Machine Learning and Deep Learning Models:

Neural Networks: Deep learning models, such as recurrent neural networks (RNNs) or transformers, can be trained to distinguish between human-written and AI-generated text. These models learn complex patterns and features in the data to make accurate predictions.
Support Vector Machines (SVMs): SVMs are machine learning algorithms can be used to classify text based on its features. By extracting relevant features from the text, SVMs can effectively differentiate between human and AI-generated content.
Generative Adversarial Networks (GANs): GANs can be used to generate realistic AI-generated text, which can then be compared to the original text to identify discrepancies.
Autoencoders: Autoencoders can be trained to learn the underlying structure of human-written text. By comparing the reconstructed text from an autoencoder to the original text, it’s possible to detect anomalies that indicate AI-generated content.

3. Natural Language Processing (NLP) Techniques:

Part-of-Speech Tagging: This technique identifies the grammatical function of words in a sentence. AI-generated text may exhibit different patterns of part-of-speech tags compared to human-written text which can be detected, albeit not that easily.
Named Entity Recognition (NER): NER identifies named entities in the text, such as people, organizations, or locations. AI-generated text may contain errors or inconsistencies in named entity recognition.

It’s important to note that these methods are not foolproof, and as AI technology continues to advance, new techniques may be developed to improve detection accuracy. Additionally, combining multiple approaches can often provide more reliable results depending on your goal setting.