Adversarial AI Detection Research
End-to-end pipeline for adversarial fine-tuning against AI text detectors
2025-07 — Present
PythonPyTorchHuggingFaceTransformersOpenRouterParquet
A multi-phase research project exploring the vulnerabilities of AI text detection systems. Phase 0 builds a large-scale data pipeline that synthesizes diverse generation, editing, and humanization training examples — using dual AI detectors (Argus + ZeroGPT) to curate samples that are confidently machine-like. Phase 1 uses this dataset for RL fine-tuning with PPO, training an LLM to produce text that evades detection while preserving writing quality.
Key Highlights
- Phase 0: automated data pipeline producing ~8K generation prompts, ~2.5K synthetic editing sources, and ~1.5K detector-curated humanization examples
- Dual detector gating: composite Argus + ZeroGPT scoring with configurable thresholds to retain only high-confidence AI text for humanization training
- Deterministic diversity planner across 16 text types with controlled axes: topic, tone, complexity, length, persona, sentence structure
- Strict batch QA with enforced length tiers, minimum spread, banned patterns, and safety rules — failed batches trigger automatic retries
- Hybrid editing corpus: real revision pairs from Grammarly CoEdIT + IteraTeR, plus LLM-generated synthetic sources
- Instruction design with controlled specificity: granularities from sentence to full rewrite, mapped to precise/medium/generic buckets via stable hashing
- Canonical Parquet assembly with deterministic 80/20 train/eval splits and baseline evaluation harness
- Resumable API runner with JSONL persistence for long-running OpenRouter batch jobs
- Phase 1: PPO-based RL training achieving 1.2% detection rate at 20% confidence threshold
- Distributed training across 8 NVIDIA GPUs with 7.5x throughput scaling