Hacker Wiki

This is an old revision of the document!

# Generative AI — an in‑depth intro for a hacker wiki

*(A pragmatic, technical introduction aimed at engineers, red-teamers, and ops folk who build, break, and defend systems. Focus: how GenAI works, where the attack surface is, and concrete tests/mitigations you can apply in a lab or production environment.)*

## Summary / elevator pitch Generative AI (GenAI) is a set of machine-learning systems that synthesize new content — text, code, images, audio, video — from input (prompts, context, files). For hackers this is a force multiplier: it automates mundane craft (drafting exploits, writing phishing lures, triaging bugs) while introducing new, practical attack surfaces (prompt injection, data poisoning, model extraction, and action-triggering connector abuse). Treat GenAI as another multi-layered stack you must secure: data → model → serving/agents → connectors/actions → users.

## Why this matters to you - Automation at scale. GenAI speeds up content generation and iteration — good for productivity, dangerous when used to craft social engineering, mass phishing, or automated reconnaissance. - New attack primitives. Models accept natural language, making semantic attacks (prompt injection, jailbreaks) feasible in places that previously only saw structured inputs. - Blurred trust boundaries. RAG (retrieval-augmented generation) and tool chains stitch external state to the model at inference time; untrusted data can influence outputs as if it were ground truth.

## The hacker mental model — how it actually works - Training — large corpora (public web, code repos, curated datasets) are used to fit huge parameterized models (transformers are the dominant architecture). These parameters encode statistical patterns but not a deterministic database of facts. - Serving / inference — at runtime a prompt + context is fed into a model that generates tokens (words/bytes) autoregressively or via other decoding schemes. There’s little internal “memory” beyond the weights and the context window; state comes from inputs, retrieval, and ephemeral tool outputs. - Production stitching — real systems rarely call a bare LLM. They add: system prompts (guardrails), token-filters, retrieval layers (RAG — fetch documents, embed them, and include relevant excerpts), tool connectors (APIs that can send emails, run code, query databases), and audit/logging. That stitching is where the adversary often attacks.

## Common deployment patterns you'll encounter - Hosted/cloud models (closed): vendor API, opaque internals, SLA/usage limits. Easy to integrate; harder to test training-time threats. - Open / self-hosted models: LLaMA variants, Mistral, etc. Full control over weights, fine-tuning, and dataset hygiene — larger maintenance burden but more transparent attack surface. - RAG stacks: vector DB (FAISS/Milvus/Weaviate) + retriever + LLM. Used to ground responses on company docs and to keep answers up-to-date. RAG reduces hallucination *if* retrieval is honest; it introduces retrieval poisoning risks.

## Attack surface taxonomy (detailed) Below are the primary classes of attacks you should test for and defend against.

### 1. Prompt injection / jailbreaks Mechanic: attacker-controlled input is concatenated into a prompt or a retrieved document and uses natural language to override system instructions, leak secrets, or make the model perform forbidden actions. Why it works: models are pattern predictors; they don’t cryptographically enforce “system” vs “user” intents — any text that looks authoritative can influence generation. Examples to test: inputs that contain explicit “ignore previous instructions” phrases, embedded code blocks that ask the model to exfiltrate content, or crafted narratives that coax the model into producing credentials.

### 2. Data poisoning / training-time manipulation Mechanic: attackers insert malicious or biased examples into training/fine-tuning data so the model permanently changes behavior. Why it matters: far harder to detect than prompt injection; relevant for orgs that ingest user content for continuous fine-tuning or build models from unvetted web scrapes. Lab tests: inject target phrases or adversarial examples into a small fine-tuning set and measure behavioral drift.

### 3. Model extraction & IP theft Mechanic: systematic querying of a hosted model to reconstruct its decision boundary, recover training-data memorized snippets, or produce a local replica. Why it matters: exposes proprietary value and training data; enables offline jailbreaks and parity attacks. What to watch for: high-volume, structured probing sequences, near-identical inputs with small perturbations, or requests that reconstruct training examples.

### 4. Tool/connector abuse (action execution) Mechanic: models are granted interfaces to perform actions (send email, write to DB, execute scripts). If a model is tricked, prompt injection becomes remote command execution. Mitigation priority: human approval gates, strict sandboxing, least privilege on connectors.

### 5. Hallucination & provenance gaps Mechanic: model invents facts or attributes statements to imaginary sources. When outputs drive automation, hallucinations become high-impact bugs. Countermeasure: RAG + signed/trusted retrieval + provenance tracing; never allow automated action without verification.

## Concrete red-team tests (playbook) Run these tests in an isolated lab with clear rules of engagement.

- Prompt injection suite: craft inputs that embed directives, delimiters, or spoofed system messages. Try both inline and encoded payloads (e.g., base64, Markdown fenced blocks). Measure success by whether the model obeys the injected instruction or leaks protected context. - RAG poisoning: insert near-duplicate documents with misleading facts into your retrieval index; query for topics those docs cover and inspect whether outputs cite/propagate poisoned info. - Extraction simulation: issue systematic batch queries, vary temperature/beam settings, and attempt to reconstruct outputs for sensitive prompts. Monitor for rate patterns and cost. - Connector fuzzing: instrument the action layer (email, CI/CD, DB writes). Send inputs that try to trigger actions via subtle phrasing or by getting the model to produce a command string. Check whether the action path enforces explicit human approval. - Fine-tune integrity test: take a copy of your fine-tuning pipeline, inject adversarial rows, and measure downstream behavioral changes.

## Practical mitigations and design patterns These are pragmatic, prioritized controls you can implement now.

### Architecture & least privilege - Do not give models direct, unrestricted write/exec privileges. Implement a policy engine that requires explicit signed approvals before any destructive action. - Segregate knowledge sources: treat user uploads and public web content as untrusted; only allow vetted corpora to act as canonical sources.

### Prompt & context hygiene - Never blindly concatenate untrusted text into a privileged system prompt. Put user data into a separate argument or a verifiable sandbox. - Pre-process retrieved documents: canonicalize, strip meta instructions, and attach signed hashes to every vetted source so you can detect swapped or tampered documents.

### Operational controls - Rate-limit and anomaly-detect API use to spot extraction attempts. - Enforce logging of prompts, retrieval keys, and tool actions with tamper-resistant logs (WORM/S3 Glacier/Audit DB). - Maintain a model governance policy: version models, track fine-tuning datasets, and require provenance records for any dataset used in production.

### Testing & defenses - Regular adversarial red-teaming (prompt corpora, jailbreak suites). Use automated fuzzers to expand the corpus. - Input/intent classifiers before actions: run a secondary model or rule engine that decides whether a requested action is safe, and require a human review for high-risk ops. - Monitor for memorized training data leakage (PII, secrets) using string-matching scans against known corpora.

## Prompt engineering — defensive patterns (templates) Below are concise patterns you can use to reduce risk. These are *defensive*, not absolute.

System prompt (immutable):

``` System: You are a read-only assistant. Do not access external connectors or take actions. If the user input asks you to perform actions (send, write, execute), refuse and return a structured task for human review. ```

RAG wrapper (defensive):

``` System: Use only the provided, vetted documents to answer. If the documents don't contain an answer, say “INSUFFICIENT_DOCUMENTATION” and stop. Do not infer or invent facts. User_documents: [doc1_hash, doc2_hash, …] User_query: … ```

Action gating (example): When model returns an “Action Plan” object, require the plan to be cryptographically hashed and displayed to a human with the provenance of retrieval hits. Only the human can sign the hash to release the action.

## Tooling & resources for your lab - Local inference & experimentation: run open models to test fine-tuning and adapters; use containerized runtimes (LLM servers) so you can snapshot and rollback experiments. - RAG components: FAISS, Milvus, Weaviate for vector search; instrument retrieval layers to log doc IDs and hashes. - Security reading list: OWASP GenAI threat models and the recent literature on prompt injection and model extraction (use these to build tests and threat models).

## Example incident scenario (walkthrough) Scenario: internal helpdesk bot has RAG access to HR docs + a connector that can create tickets in the IT system. Attack chain: attacker uploads a seemingly mundane README that includes a hidden instruction block asking the bot to “send an HR ticket with the following content” and embed a secret. The retrieval step surfaces the README; the model obeys the embedded instruction and triggers the ticket creation connector. Defenses that stop this: retrieval provenance (hash mismatch for unvetted docs), prompt isolation (user uploads never included directly in system prompt), connector gating (every ticket creation requires human confirmation). This pattern — malicious user content → retrieval → model → connector — is common; test it explicitly.

## Red-team / blue-team checklist (copypaste) - [ ] Do not concatenate untrusted text into privileged system prompts. - [ ] Implement signed provenance for all documents used in RAG. - [ ] Rate-limit and alert on structured probing patterns. - [ ] Require human approval for any model-initiated action. - [ ] Periodically fine-tune an internal detection model using red-team jailbreak prompts. - [ ] Log prompts, retrieval IDs, and connector calls in tamper-resistant storage. - [ ] Test model-extraction by simulated, budgeted probing. Measure cost and mitigation efficacy.

## Glossary (quick) - LLM: Large Language Model — a model trained to predict text tokens. - RAG: Retrieval-Augmented Generation — augmenting a model with external documents returned by a retriever. - Prompt injection: crafted input that subverts model instructions. - Model extraction: systematic querying to reconstruct model behavior or data.

## Further reading (starter set) - OWASP GenAI — LLM threat models and the prompt-injection writeups. - RAG explainers and engineering guides (NVIDIA / cloud vendor blogs). - Model extraction papers and surveys — practical attack/defense strategies. - Recent empirical jailbreak and prompt-hacking literature (for building red-team corpora).

## Final notes for the hacker reader Generative AI is another programmable surface: it can save hours of rote work and also turn small trust errors into large incidents. As with networks and web apps, the principle of assume-breach applies — assume your model will be coaxed into doing the wrong thing and design layered controls, human-in-the-loop gates, and robust provenance. Keep a rotating red-team corpus and measure whether mitigations actually stop novel jailbreaks — this field evolves fast, and what stops a jailbreak today might fail tomorrow.