=== Large Language Models (LLMs) — an in‑depth guide for hackers === **Summary:** This page breaks down what modern Large Language Models (LLMs) are, how they’re built and run, what they *actually* do under the hood, common attack/surfacing vectors, practical tooling and prompt‑engineering tips, and ethical / defensive considerations. Target audience: engineers, security researchers, and curious hackers who want a technical — practical — usable explanation. ---- == What is an LLM? == A **Large Language Model** is a statistical model trained to predict and generate human language. Concretely, an LLM maps a sequence of tokens (text pieces) to a probability distribution over the next token and can be sampled repeatedly to produce sentences, code, or other text. Modern LLMs are almost always deep neural networks (largely Transformer architectures) trained on very large text corpora. Why hackers care: * They automate code generation, information extraction, fuzzing helpers, and triage. * They expose new attack surfaces (prompt injection, model hallucination, data leakage). * They can be fine‑tuned or adapted to specific tasks — useful, but also risky. == Basic components == * **Tokenizer** — splits raw text into tokens. Tokens are subword units (BPE / SentencePiece / Unigram). Tokenization determines model input length and how text is represented. * **Embedding layer** — maps tokens to vectors. * **Transformer blocks** — stacks of multi‑head self‑attention + feedforward layers. Attention lets the model weigh different positions in the sequence. * **Output head (softmax)** — converts final hidden state back into probabilities over the vocabulary. * **Loss & training loop** — usually cross‑entropy on next‑token prediction, sometimes augmented with auxiliary losses (masking, contrastive learning). * **Decoding** — sampling strategies: greedy, beam, top‑k, top‑p (nucleus), temperature control. Decoding strategy strongly shapes output behavior. == Tokenization, practically == Tokens shape cost, throughput, and behavior. * Example: "Hacker" may be one token, but "hacking" could be tokenized as "hack" + "ing". Non‑ASCII or domain‑specific text often tokenizes into many tokens (expensive). * Always measure token counts: many APIs charge by token. == How they learn — training at a glance == 1. Collect huge text corpus (web scrape, books, code). 2. Preprocess and tokenize. 3. Train with gradient descent on GPUs/TPUs. Typical training mixes long context windows, weight decay, learning‑rate schedules, and huge batch sizes. 4. Optionally fine‑tune on labeled data (supervised or RLHF — reinforcement learning from human feedback). /// RLHF (short) /// RLHF is a supervised fine‑tuning pass where humans rank model outputs; a reward model is trained and used to optimize model outputs with policy optimization (PPO variants). RLHF changes behavior to appear more helpful and safe, but it’s brittle and can be gamed. == What LLMs "know" and hallucinate == LLMs are **pattern learners**, not symbolic reasoners. That yields: * **Memorization** — frequent or unique phrases can be regurgitated verbatim (privacy risk). * **Interpolation** — plausible but invented facts (hallucinations). * **Surface reasoning** — they can often emulate reasoning by chaining learned patterns, but deep, systematic reasoning or multi‑step arithmetic is unreliable unless explicitly engineered with scaffolding. **For hackers:** never treat an LLM output as authoritative. Validate outputs, especially for code, credentials, or instructions. == Prompts, few‑shot, and chain‑of‑thought == * **Zero‑shot** — plain instruction. * **Few‑shot** — include examples in the prompt to steer format and style. * **Chain‑of‑Thought (CoT)** — requesting stepwise reasoning can improve multi‑step tasks but may increase hallucination risk. For sensitive tasks, prefer verifiable intermediate checks. Tactics: * Be explicit about output format (JSON, YAML, code blocks). * Provide constraints (token limits, allowed libraries). * Use role prompts (e.g., **You are a senior reverse engineer...**). * Force verification: ask the model to run basic sanity checks on its answer (for example, ask it to explain why each step is safe or to provide test cases). == Fine‑tuning and adapters == Fine‑tuning adapts a base LLM to a domain: * **Full fine‑tune** — retrain some or all weights. Powerful but expensive and can overfit. * **Parameter‑efficient methods** — LoRA, adapters — inject small, trainable modules for cheap adaptation. * **Embedding + retrieval** (RAG) — keep the LLM frozen and combine it with a vector DB and retrieval to ground outputs in documents. This reduces hallucinations and helps add up‑to‑date facts. For hackers: LoRA makes local adaptation feasible on modest GPU hardware; RAG is great for building knowledge‑grounded assistants without retraining. == Embeddings and retrieval == * **Embeddings** map text to vectors for similarity search. * Typical pipeline: index document chunks → embed queries → nearest neighbor search → feed retrieved chunks into prompt (RAG). Tricks: * Chunk size affects recall and context window usage. * Use metadata (source, timestamp) to increase traceability. * Score and include provenance when returning results. == Security and attack vectors (practical) == 1. **Prompt injection** — attacker‑controlled content in the model context tries to override system instructions. *Mitigations:* Sanitize or isolate user data; never include raw user text in system‑level prompts. Use retrieval filters and treat all retrieved text as *untrusted*. 2. **Data leakage / memorization** — highly unique secrets in training data can be regurgitated. *Mitigations:* Don’t feed secrets into prompts; redact; use differential privacy in training where possible. 3. **Model extraction** — an attacker queries many times to reconstruct model behavior or weights. *Mitigations:* Rate limit, monitor unusual query patterns, add noise to outputs if appropriate. 4. **Jailbreaking** — attempts to coerce the model to violate safety or policy. *Mitigations:* Multi‑level defenses (model‑side filters + prompt‑side constraints). Use a separate pessimistic safety‑checker model on outputs. 5. **Poisoning** — sabotage training or fine‑tuning data. *Mitigations:* Data provenance, curated sources, anomaly detection. == Practical tooling & workflows for hackers == * **Local inference:** small/medium LLMs can run locally with Hugging Face transformers, GGML runtimes, and quantized weights. Good for experimentation and offline attacks. * **APIs:** hosted models give scale and performance but expand attack surface and privacy concerns. * **Vector DBs:** FAISS, Milvus, Weaviate for retrieval setups. * **Prompt testing harnesses:** write unit tests for prompts — deterministic expected outputs and regression tests. * **Red‑team setups:** automated fuzzers that mutate prompts and payloads to find jailbreaks or hallucination triggers. == Examples (useful patterns) == === Structured output === Use an instruction + strict output schema to reduce ambiguity:
You are a JSON generator. Extract the following fields from the text and return valid JSON only:

{
  "title": string,
  "date": "YYYY-MM-DD",
  "emails": [string],
  "summary": string
}
=== Retrieval‑augmented prompt pattern === *Step 1:* Retrieve top 3 document chunks for query. *Step 2:* Prompt:
Context:
[DOC 1]
[DOC 2]
[DOC 3]

Task: Using only the information above, answer the question and cite the doc id(s) that support each claim.
== Evaluation and metrics == Common signals: * **Perplexity** — model fit (lower is better) but not human usefulness. * **BLEU / ROUGE** — compare to references (limited). * **Human eval / A/B testing** — still gold standard for quality and safety. * **Safety checks** — automated classifiers to flag toxicity, hallucination, or PII leakage. == Ethics & responsible use == * Don’t use LLMs to create disinformation, targeted harassment, or to automate cyberattacks. * When publishing outputs based on LLMs, disclose usage and provenance. * Treat LLM outputs as *assistive*, not authoritative — always verify. == Quick checklist for deploying an LLM‑powered tool == * Does the tool leak training data or user input? Redact where needed. * Are there rate limits, monitoring, and alerting on anomalous usage? * Do you have a retrieval pipeline with provenance for factual claims? * Can you rollback or disable the model quickly if abused? * Do you have legal / compliance coverage for data retention and privacy? == Further experiments & learning projects == Ideas hackers enjoy: * Create a prompt fuzzing harness that mutates system prompts and looks for jailbreaks. * Build a small LoRA adapter for a code‑completion task and compare to base model. * Implement a RAG pipeline with FAISS and measure hallucination rate vs. a plain LLM. * Try token‑level attacks: craft inputs that split tokens oddly to influence decoding. == Closing notes == LLMs are powerful pattern machines — extremely useful as assistants, coders, and research tools — but their weaknesses (hallucination, memorization, injection vulnerabilities) are real and exploitable. Treat outputs with suspicion, design with defense in depth, and instrument everything.