Adversarial AI Detection Research

End-to-end pipeline for adversarial fine-tuning against AI text detectors

2025-07 — Present

PythonPyTorchHuggingFaceTransformersOpenRouterParquet

A multi-phase research project exploring the vulnerabilities of AI text detection systems. Phase 0 builds a large-scale data pipeline that synthesizes diverse generation, editing, and humanization training examples — using dual AI detectors (Argus + ZeroGPT) to curate samples that are confidently machine-like. Phase 1 uses this dataset for RL fine-tuning with PPO, training an LLM to produce text that evades detection while preserving writing quality.

Key Highlights

Phase 0: automated data pipeline producing ~8K generation prompts, ~2.5K synthetic editing sources, and ~1.5K detector-curated humanization examples
Dual detector gating: composite Argus + ZeroGPT scoring with configurable thresholds to retain only high-confidence AI text for humanization training
Deterministic diversity planner across 16 text types with controlled axes: topic, tone, complexity, length, persona, sentence structure
Strict batch QA with enforced length tiers, minimum spread, banned patterns, and safety rules — failed batches trigger automatic retries
Hybrid editing corpus: real revision pairs from Grammarly CoEdIT + IteraTeR, plus LLM-generated synthetic sources
Instruction design with controlled specificity: granularities from sentence to full rewrite, mapped to precise/medium/generic buckets via stable hashing
Canonical Parquet assembly with deterministic 80/20 train/eval splits and baseline evaluation harness
Resumable API runner with JSONL persistence for long-running OpenRouter batch jobs
Phase 1: PPO-based RL training achieving 1.2% detection rate at 20% confidence threshold
Distributed training across 8 NVIDIA GPUs with 7.5x throughput scaling