The Problem Nobody in AI Is Building For
In a mass casualty event — an earthquake, a flood, a building collapse — first responders face decisions that kill or save children in seconds. JumpSTART triage asks: is the child breathing? What is the respiratory rate? Is the radial pulse present? Depending on the answers, the child gets tagged RED, YELLOW, GREEN, or BLACK — each color dictating the immediacy of intervention.
These protocols are well-established. JumpSTART (Lou Romig, MD), PALS from the American Heart Association, SALT mass casualty triage, and the Broselow Pediatric Emergency Tape. The problem is not the protocols. The problem is the gap between protocol knowledge and field availability.
In disaster zones, connectivity collapses. There is no 4G. There is no cloud API to call. The volunteer sitting next to a 6-year-old pulled from rubble does not have a pediatric emergency physician on the phone. They have whatever knowledge they carried into the field — and whatever tools will run on their device without internet.
The field device may be a ruggedized Android tablet, a mid-range smartphone, or an older laptop. RAM is constrained. Battery is limited. Connectivity is zero. The model must run locally, instantly, and with enough medical accuracy to meaningfully guide triage decisions without expert oversight on-site.
General-purpose LLMs like GPT-4 or Claude are not the answer here. They require internet. They are too large for on-device inference. And they are trained on broad medical text, not on the specific decision trees and age-weight-dose relationships that emergency responders need in the field. This is a case where a small, specialized, offline model beats a large, general, cloud-hosted one — by a wide margin.
Why This Is an Edge AI Problem, Not a RAG Problem
The first instinct of most AI engineers facing a "domain knowledge" problem is to reach for Retrieval-Augmented Generation: embed the protocols, store them in a vector database, retrieve relevant chunks at query time. RAG is the right tool for many problems. This is not one of them.
RAG requires a retrieval infrastructure. In the field, there is no vector database server. There is no embedding API. There is no network to reach either. More fundamentally, triage decisions under stress need to be instantaneous — the cognitive overhead of a responder reading retrieved chunks and synthesizing them into an action is friction that costs lives. The model needs to know the protocols the way a trained paramedic knows them: internalized, retrievable under pressure, structured as actionable outputs.
That means fine-tuning. Knowledge baked into weights. Zero retrieval overhead. Zero connectivity dependency.
Choosing the Right Base Model: Why LFM2.5-1.2B
The base model choice is the most consequential decision in an edge AI project. The wrong model and you are either too large to run on-device or too weak to produce medically coherent outputs. We evaluated the constraint space before touching a single line of training code.
We chose LiquidAI/LFM2.5-1.2B-Instruct — a Liquid Foundation Model from LiquidAI. LFM2.5 is not a standard transformer. It is a hybrid architecture combining Liquid Neural Network (LNN) principles with structured state-space components — specifically, LIV (Liquid Input-output Variant) convolution layers alongside grouped-query attention (GQA). This architecture makes two things true simultaneously that are usually in tension: strong reasoning coherence and extreme parameter efficiency.
What LFM2.5 Gets Right for Edge Deployment
Standard transformer models at 1B parameters degrade quickly on multi-step clinical reasoning. The attention patterns that enable coherent chain-of-thought in larger models simply do not form reliably at this scale in a vanilla architecture. LFM2.5 addresses this through its hybrid LNN-attention design: the LIV convolution layers provide sequential state tracking that augments attention, giving the model a form of recurrence that preserves logical chain across steps even at small scale.
For our use case — where responses must walk through triage category, immediate actions, and explicit contraindications in a structured format — this sequential coherence matters enormously. The model needs to reason: child is not walking → check breathing → breathing present → check rate → 36/min is elevated for age 5 → RED triage. That is a five-step conditional chain that 1B-parameter vanilla transformers often break.
The second advantage is quantization tolerance. LFM2.5-1.2B quantizes well to Q4_K_M GGUF format, maintaining output quality at under 1GB RAM footprint — a hard requirement for sub-2GB RAM field devices.
| Model | Params | GGUF (Q4_K_M) | Edge Viable | Clinical Coherence |
|---|---|---|---|---|
| LFM2.5-1.2B | 1.2B | ~750MB | Yes | Strong (hybrid arch) |
| Llama 3.2-1B | 1B | ~650MB | Yes | Moderate |
| Qwen2.5-1.5B | 1.5B | ~950MB | Marginal | Moderate |
| Phi-3.5-mini | 3.8B | ~2.4GB | No | Strong |
| Llama 3.1-8B | 8B | ~5.0GB | No | Very Strong |
The Dataset: Encoding Protocols as Conversations
Fine-tuning quality is determined by dataset quality more than any hyperparameter choice. For a domain as narrow and high-stakes as pediatric disaster triage, this meant constructing the dataset by hand — not scraping medical text, not using LLM-synthesized pairs without expert review.
The dataset is structured in the ChatML multi-turn conversation format that LFM2.5-Instruct was instruction-tuned on. Each example has three roles: a system message encoding the assistant's persona and constraints, a user message presenting a field scenario or protocol question, and an assistant response structured as: triage category, rationale, immediate actions, explicit contraindications (Do NOT), and reassessment triggers.
// pediatric_disaster_poc.jsonl — example structure { "messages": [ { "role": "system", "content": "You are a pediatric disaster response assistant trained on JumpSTART triage, PALS protocols, and field emergency medicine. You help first responders make fast, accurate decisions for injured children in offline, resource-limited environments. Always state triage category, immediate action, and what NOT to do." }, { "role": "user", "content": "5-year-old pulled from earthquake rubble. Right leg crushed, significant bleeding. Screaming, alert, breathing fast at 36/min, radial pulse weak but present." }, { "role": "assistant", "content": "TRIAGE: RED (Immediate) Rationale: Crush injury with significant bleeding and compensated shock. RR 36/min elevated for age (normal 5yr: 20-30/min) ... Immediate Actions: 1. Apply tourniquet above crush injury — note time on skin 2. Direct pressure to additional bleeding points ... Do NOT: Remove tourniquet once applied. No food or water. Do not straighten crushed limb. Crush Syndrome Warning: reperfusion arrhythmia risk at extrication." } ] }
The ~60 examples in the PoC dataset cover the full scope of field scenarios: JumpSTART algorithm decision paths, vital sign reference tables by age group (newborn through adolescent), specific disaster types (earthquake crush injury, flood near-drowning, fire burns and inhalation, chemical exposure with decontamination protocol), CPR technique differences between infants and children, weight-based dosing via the Broselow tape, shock recognition and progression, pediatric spinal precautions, and psychological first aid for uninjured but distressed children.
Token lengths across the dataset range from 282 to 490 tokens per example — all well within the 1,024 token training window. This is intentional: responses are structured to be field-readable under stress, not exhaustive clinical references.
Every assistant response follows a strict output schema: TRIAGE category first, rationale second, numbered immediate actions, explicit Do NOT list, reassessment triggers. This is not stylistic — it is a trained behavioral constraint. A model that buries the triage category in paragraph four of its response fails in the field even if the medical content is accurate.
Why Unsloth: The Engineering Case for 60% Less VRAM
Once the dataset is ready, the training question becomes: how do you fine-tune a model efficiently on a single commodity GPU — or on a cloud GPU instance you are billed by the minute for? This is where Unsloth's engineering makes a material difference.
What Unsloth Actually Does
Unsloth is not just a wrapper around Hugging Face Transformers. It is a set of fused CUDA kernels and memory management techniques that significantly reduce the VRAM footprint and wall-clock time of LoRA fine-tuning. The headline numbers — 60% less VRAM, 2× faster training — hold in practice on small-to-medium PoC datasets like ours.
The key optimizations are:
- Fused cross-entropy loss: Standard loss computation materializes the full logit matrix (vocab_size × sequence_length) in GPU memory before computing loss. Unsloth computes the loss in chunks, never materializing the full matrix — dramatically reducing activation memory at the backward pass.
- Custom gradient checkpointing: The
use_gradient_checkpointing="unsloth"flag enables a carefully tuned selective recomputation strategy. Rather than checkpointing at every layer (slow) or no layers (high VRAM), it checkpoints at the sequence position boundary — optimal for chat-format data with variable-length turns. - FastLanguageModel loading: Unsloth's
FastLanguageModel.from_pretrained()patches attention layers at load time with triton-optimized kernels, replacing the standard HuggingFace attention implementation with fused forward/backward passes.
For LFM2.5's hybrid architecture, there is an additional consideration. The LIV convolution layers and GQA attention require explicit LoRA target module selection — the default "all-linear" targeting used for standard transformers misses the architecture-specific projection layers. We target eight specific modules:
# LFM2.5-specific LoRA target modules # Covers: GQA attention projections + LIV convolution gates + FFN model = FastLanguageModel.get_peft_model( model, r=16, # LoRA rank target_modules=[ "q_proj", "k_proj", "v_proj", # GQA attention "out_proj", # attention output "in_proj", # LIV input gate "w1", "w2", "w3", # FFN SwiGLU projections ], lora_alpha=16, # alpha = r: no scaling, per Unsloth recommendation lora_dropout=0, # no dropout — small dataset, regularize elsewhere bias="none", use_gradient_checkpointing="unsloth", random_state=3407, )
Train on Responses Only
One of the most important and often skipped SFT practices is masking the loss on non-assistant tokens. When training on chat-formatted data, the model sees system prompts and user messages in every forward pass. If we compute cross-entropy loss across all tokens, we are training the model to predict the user's question given the system prompt — exactly the wrong objective. We want the model to learn to produce high-quality assistant responses given the context.
TRL's train_on_responses_only() function applies this masking by setting the labels of all non-assistant tokens to -100 (ignored in cross-entropy). With our dataset's specific ChatML delimiters, this is configured as:
# Mask system + user tokens; only compute loss on assistant responses trainer = train_on_responses_only( trainer, instruction_part="<|im_start|>user\n", response_part="<|im_start|>assistant\n", )
On a dataset of 60 examples with ~400 tokens each, this effectively reduces the number of gradient-informative tokens per step by roughly 60% — the system and user portions are masked. This is not a problem; it means the gradient signal is more concentrated on the behavior we actually care about.
Why Hugging Face: The Platform Decision
Choosing Hugging Face as the training and deployment platform is not just a convenience decision — it is a complete MLOps architecture choice. For a project like this, HF solves three separate problems: compute access, model versioning, and deployment artifact management.
HF Jobs: Serverless GPU Training Without Infrastructure
Running fine-tuning locally requires owning or renting a GPU. For a PoC dataset of 60 examples training for 3 epochs on a 1.2B model, the training completes in approximately 15 minutes on an NVIDIA A10G (24GB VRAM). HF Jobs provides exactly this GPU class as a managed serverless resource:
# Launch training on HF's managed a10g-small GPU # No infrastructure setup. Billed per minute. hf jobs uv run scripts/sft_pediatric_lfm.py \ --flavor a10g-small \ --secrets HF_TOKEN \ --timeout 1h \ -- \ --dataset data/pediatric_disaster_poc.jsonl \ --num-epochs 3 \ --output-repo username/pediatric-disaster-lfm-1.2b
The uv run command uses the inline dependency specification at the top of the script — no requirements.txt, no separate environment setup. The dependencies are declared directly in the script header and resolved by uv at runtime:
# /// script # requires-python = ">=3.10" # dependencies = [ # "unsloth", # "datasets", # "trl==0.22.2", # "huggingface_hub[hf_transfer]", # "trackio", # "tensorboard", # "transformers==4.57.3", # ] # ///
HF Hub: Model Versioning and Artifact Management
After training, the pipeline produces two output artifacts depending on the flags passed:
- LoRA adapter only: ~60MB adapter checkpoint pushed to HF Hub. Requires the base model to be loaded separately at inference time. Appropriate for development and evaluation environments where the base model is already available.
- Merged 16-bit model + GGUF: The LoRA weights are merged into the base model via
push_to_hub_merged(save_method="merged_16bit"), then quantized to GGUF Q4_K_M format viapush_to_hub_gguf(quantization_method="q4_k_m"). The resulting GGUF artifact is ~750MB and directly consumable by llama.cpp for offline inference.
HF Hub also handles metadata tagging automatically — the training script applies tags for pediatric, disaster-response, triage, jumpstart, medical, edge-ai, and lfm — making the model discoverable and communicating its intended use context to anyone who encounters it.
Trackio: Live Training Monitoring
For PoC runs where you want to observe loss curves in real time without spinning up a full MLflow or W&B infrastructure, Trackio provides a lightweight HF Space-based dashboard. When a --trackio-space argument is provided, training metrics stream live to the configured HF Space, giving full observability into training and eval loss curves during the job run.
Training Configuration: The Numbers That Matter
For a 60-example PoC dataset, the training configuration choices have outsized impact on whether the model learns the target behavior or overfits and collapses. Here are the specific decisions and the reasoning behind each.
| Hyperparameter | Value | Rationale |
|---|---|---|
| LoRA rank (r) | 16 | Sufficient expressivity for domain-specific protocols without risk of parameter explosion. Higher rank on tiny datasets accelerates overfitting. |
| LoRA alpha | 16 | alpha = r → scaling factor of 1.0. Unsloth recommends this for stability. Avoids the alpha/r ratio tuning problem entirely. |
| Learning rate | 2e-4 | Standard for LoRA fine-tuning. Higher rates cause loss spikes on small datasets; lower rates under-fit in 3 epochs. |
| Effective batch size | 8 (2 × 4 accum) | 2 per-device batch × 4 gradient accumulation = 8 effective. Stabilizes gradient estimates across mini-batches from a 60-example dataset. |
| Epochs | 3 | The 60-example dataset fits in ~8 gradient steps per epoch. 3 epochs = ~24 updates — enough to shift behavior without memorizing verbatim. |
| Optimizer | AdamW 8-bit | bitsandbytes 8-bit Adam reduces optimizer state VRAM from 32-bit (two fp32 moment tensors) to ~2GB savings on 1.2B parameters. |
| LR scheduler | Linear decay | Simple and stable for short runs. Cosine decay is marginally better at larger scale but adds negligible value for 24-step training. |
| Max seq length | 1,024 | Dataset token range is 282–490. 1,024 provides comfortable headroom without wasting KV-cache allocation on empty positions. |
| Eval split | Disabled | 60 examples is too small for a meaningful eval split. Withholding even 10% (6 examples) produces an unreliable eval signal. Evaluate behaviorally post-training instead. |
Cloud Architecture: From Training Pipeline to Field Deployment
Understanding the end-to-end system architecture is essential for anyone considering deploying this pattern in production. There are three distinct zones: the development and training zone, the model registry, and the field deployment zone. The architecture is designed so that once the model artifact lands on a device, it requires no further cloud connectivity.
╔══════════════════════════════════════════════════════════════════════════╗ ║ PEDIATRIC DISASTER RESPONSE — EDGE AI ARCHITECTURE ║ ╚══════════════════════════════════════════════════════════════════════════╝ ┌─────────────────────────────────────────────────────────────────────┐ │ ZONE 1 — DEVELOPMENT & TRAINING (Cloud / Dev Machine) │ └─────────────────────────────────────────────────────────────────────┘ Developer Workstation ┌─────────────────────┐ │ Dataset Authoring │ ← JumpSTART, PALS, SALT, Broselow protocols │ JSONL (ChatML fmt) │ hand-crafted instruction-response pairs │ prepare_dataset.py │ ← Validate format, check token distribution └────────┬────────────┘ │ git push / hf upload ▼ ┌─────────────────────────────────────────────────────────────────┐ │ Hugging Face Hub (Dataset Repository) │ │ pediatric_disaster_poc.jsonl │ └────────────────────────────────────────────────────────────────-┘ │ │ hf jobs uv run sft_pediatric_lfm.py │ --flavor a10g-small --timeout 1h ▼ ┌─────────────────────────────────────────────────────────────────┐ │ HF Jobs — Managed GPU Compute (NVIDIA A10G, 24GB VRAM) │ │ │ │ FastLanguageModel.from_pretrained() │ │ └─ LFM2.5-1.2B-Instruct (16-bit, ~2.4GB VRAM) │ │ │ │ get_peft_model() → LoRA rank=16 │ │ └─ trainable params: ~20M of 1.2B (~1.7%) │ │ │ │ SFTTrainer (TRL) │ │ ├─ train_on_responses_only() (loss masked on user/sys) │ │ ├─ adamw_8bit, lr=2e-4, effective_batch=8 │ │ ├─ 3 epochs (~15 min) │ │ └─ Trackio live metrics → HF Space dashboard │ │ │ │ Post-training export: │ │ ├─ push_to_hub_merged() → merged 16-bit weights │ │ └─ push_to_hub_gguf(q4_k_m) → ~750MB GGUF artifact │ └─────────────────────────────────────────────────────────────────┘ │ model.push_to_hub() / push_to_hub_gguf() ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ ZONE 2 — MODEL REGISTRY (Hugging Face Hub) │ └─────────────────────────────────────────────────────────────────────┘ username/pediatric-disaster-lfm-1.2b ┌───────────────────────────────────────────────────┐ │ Artifact 1: LoRA adapter (~60MB) │ │ ├─ adapter_config.json │ │ └─ adapter_model.safetensors │ │ │ │ Artifact 2: Merged 16-bit (~2.4GB) │ │ ├─ model.safetensors │ │ └─ tokenizer files │ │ │ │ Artifact 3: GGUF Q4_K_M (~750MB) ← PRIMARY │ │ └─ pediatric-disaster-lfm-1.2b-Q4_K_M.gguf │ │ │ │ Tags: pediatric · triage · jumpstart · edge-ai │ └───────────────────────────────────────────────────┘ │ │ one-time device sync (Wi-Fi / USB) ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ ZONE 3 — FIELD DEPLOYMENT (Fully Offline) │ └─────────────────────────────────────────────────────────────────────┘ On-Device (ruggedized tablet / laptop / Android) ┌───────────────────────────────────────────────────┐ │ llama.cpp inference engine (CPU/GPU agnostic) │ │ │ │ llama-cli │ │ -m pediatric-lfm-Q4_K_M.gguf │ │ --chat-template chatml │ │ -p "You are a pediatric disaster assistant..." │ │ │ │ Runtime footprint: │ │ RAM: ~750MB │ │ CPU: 4-core ARM / x86 sufficient │ │ Connectivity: None required │ │ Latency: 2–5 seconds / response (CPU) │ └───────────────────────────────────────────────────┘ │ │ Responder interface (CLI / mobile app) ▼ ┌───────────────────────────────────────────────────┐ │ Field Responder Query │ │ "8yo, pulled from flood, unresponsive, RR 8/min" │ │ │ │ Model Response │ │ TRIAGE: RED (Immediate) │ │ 1. Open airway — head-tilt chin-lift │ │ 2. Assess breathing after repositioning ... │ │ Do NOT: Remove airway support. Do not leave... │ └───────────────────────────────────────────────────┘
The Offline-First Guarantee
The architecture has a deliberate hard boundary at Zone 3. Once the GGUF artifact is on the device, the system has zero external dependencies. No API keys expire. No cloud endpoints go down. No rate limits apply. The responder's device becomes a fully self-contained medical decision support system. The tradeoff is that model updates require a device sync — but in a disaster response context, protocol updates are infrequent and can be pushed during debrief phases when connectivity is restored.
The GGUF Export Pipeline: From LoRA to llama.cpp
The path from a LoRA adapter to a llama.cpp-compatible GGUF file involves three sequential operations. Understanding each is important if you need to customize the quantization or debug a failed export.
Step 1: LoRA Merge
During training, the LoRA adapter lives as a set of low-rank matrices (A, B per targeted layer) that are added to the frozen base model weights during the forward pass: W_adapted = W_base + (B × A) × (alpha/r). The adapter is small and efficient for training. But for inference, especially on-device inference via llama.cpp, you want the merged weights — a single weight matrix per layer, no runtime addition. Unsloth's push_to_hub_merged(save_method="merged_16bit") performs this merge in fp16 and pushes the full model to HF Hub.
Step 2: GGUF Quantization (Q4_K_M)
Q4_K_M is a 4-bit quantization scheme from the GGUF specification. The "K" denotes k-quantization (a per-block scaling technique that improves accuracy versus simple 4-bit), and "M" denotes medium quality within the K family. For LFM2.5-1.2B at fp16 (~2.4GB), Q4_K_M produces a ~750MB artifact — a 3.2× compression — with perplexity degradation that remains well within acceptable bounds for our factual recall use case.
Q4_K_M is the recommended default for edge deployments that need maximum accuracy within a 1GB RAM budget. Q5_K_M provides marginally better quality at ~950MB. Q8_0 retains near-fp16 quality at ~1.3GB — feasible on devices with 4GB RAM but too large for constrained field hardware. For pediatric triage specifically, Q4_K_M's accuracy loss on numerical recall (vital sign ranges, drug doses) is acceptably low given the structured nature of the training data.
Step 3: llama.cpp Inference
On the field device, llama.cpp provides inference without requiring CUDA, PyTorch, or any Python runtime. It compiles to a standalone binary that reads the GGUF file directly. On a 4-core ARM device, response latency is 2–5 seconds for a 200-token assistant response — fast enough for field use.
# On-device offline inference — no connectivity required llama-cli \ -m pediatric-disaster-lfm-1.2b-Q4_K_M.gguf \ --chat-template chatml \ -p "You are a pediatric disaster response assistant trained on JumpSTART triage and PALS protocols. You help first responders in offline, resource-limited environments." \ --color \ -i # interactive mode for field queries
What's Next: From PoC to Production-Grade
The current state of the project is a validated PoC: dataset created, training pipeline functional, export pipeline working. The gap between this PoC and a deployment-ready system is well-defined and achievable. Here is the path forward.
Evaluation Against Ground Truth
The most critical next step is protocol-adherence evaluation. JumpSTART is a deterministic decision tree — given a specific set of patient observations, there is exactly one correct triage category. This makes automated evaluation tractable: generate a held-out test set of 50+ scenarios with known correct triage classifications, run the model, and measure exact-match accuracy on the triage category. Secondary metrics include Do NOT compliance (does the model hallucinate contraindicated actions?) and vital sign numerical accuracy (are the stated normal ranges correct for the specified age group?).
Dataset Expansion via Structured Generation
The 60-example PoC covers core scenarios but underrepresents edge cases: multi-casualty scenarios requiring triage priority decisions between two simultaneous patients, pediatric-specific toxicological exposures, neonatal (under 28 days) edge cases that differ from JumpSTART's 1–8 year scope. Expanding to 300–500 examples — still tractable for manual quality review — would meaningfully improve robustness without changing the training infrastructure.
DPO Alignment Pass
After SFT establishes the base behavior, a Direct Preference Optimization (DPO) pass can tighten behavioral alignment. The preference dataset would consist of response pairs: a correct, well-structured response versus a plausible-but-incorrect one (wrong triage category, missing Do NOT, incorrect age-specific vital range). DPO trains the model to assign higher probability to the preferred response, directly optimizing for the behaviors that SFT only indirectly incentivizes through next-token prediction.
Mobile Application Shell
The llama.cpp binary is powerful but CLI-only. Wrapping it in a native mobile application — Android preferred for ruggedized device ecosystem compatibility — adds a structured input interface (checkboxes for observed symptoms, dropdowns for age group), structured output rendering (triage color displayed prominently, actions in numbered list), and session logging for post-incident review. Frameworks like MLC-LLM or llama.cpp's Android Java bindings provide the inference layer; the UI layer is standard Android development.
The Broader Principle
This project is a specific instantiation of a broader architectural principle that I think will define the next wave of AI deployment: not every AI system belongs in the cloud.
The instinct in AI engineering has been to centralize: bigger models, more compute, cloud APIs, managed infrastructure. That instinct is correct for many use cases. But it creates a hidden fragility — a dependency on connectivity and cloud availability that simply does not hold in the environments where AI could do the most good. Disaster response. Remote healthcare. Contested environments. Edge industrial operations.
The combination of efficient small-model architectures (LFM2.5, Llama 3.2, Qwen2.5 at 1–3B), aggressive quantization (GGUF Q4_K_M), fast fine-tuning frameworks (Unsloth), and managed cloud training (HF Jobs) has removed most of the barriers to building these systems. You can train a domain-specialized 1.2B model in 15 minutes on a rented GPU, export it to a 750MB file, and deploy it to a device that works in a field with no connectivity. The full pipeline runs in an afternoon.
The pediatric disaster response model is a PoC. But the architecture it represents — specialized, fine-tuned, quantized, offline-capable — is a pattern that will show up across healthcare, defense, industrial operations, and anywhere the assumption of always-on connectivity fails. Building it matters beyond the specific use case.