Adversarial AI Detection Research

End-to-end pipeline for adversarial fine-tuning against AI text detectors

2025-07 — Present
PythonPyTorchHuggingFaceTransformersOpenRouterParquet

A multi-phase research project exploring the vulnerabilities of AI text detection systems. Phase 0 builds a large-scale data pipeline that synthesizes diverse generation, editing, and humanization training examples — using dual AI detectors (Argus + ZeroGPT) to curate samples that are confidently machine-like. Phase 1 uses this dataset for RL fine-tuning with PPO, training an LLM to produce text that evades detection while preserving writing quality.

Key Highlights

  • Phase 0: automated data pipeline producing ~8K generation prompts, ~2.5K synthetic editing sources, and ~1.5K detector-curated humanization examples
  • Dual detector gating: composite Argus + ZeroGPT scoring with configurable thresholds to retain only high-confidence AI text for humanization training
  • Deterministic diversity planner across 16 text types with controlled axes: topic, tone, complexity, length, persona, sentence structure
  • Strict batch QA with enforced length tiers, minimum spread, banned patterns, and safety rules — failed batches trigger automatic retries
  • Hybrid editing corpus: real revision pairs from Grammarly CoEdIT + IteraTeR, plus LLM-generated synthetic sources
  • Instruction design with controlled specificity: granularities from sentence to full rewrite, mapped to precise/medium/generic buckets via stable hashing
  • Canonical Parquet assembly with deterministic 80/20 train/eval splits and baseline evaluation harness
  • Resumable API runner with JSONL persistence for long-running OpenRouter batch jobs
  • Phase 1: PPO-based RL training achieving 1.2% detection rate at 20% confidence threshold
  • Distributed training across 8 NVIDIA GPUs with 7.5x throughput scaling