Model Limits & Guarantees
Explicit, honest information about how AI models are used in Rihario. This document explains what guarantees exist, what happens when things fail, and what the models can and cannot do.
One-Model-Per-Test Rule
Guarantee: Every test run uses exactly ONE reasoning model (GPT-5 Mini) from start to finish.
What This Means
- No model switching - The same model is used throughout the entire test
- No fallback models - If the model fails, the test fails (with clear error messages)
- Consistent behavior - Same model = same behavior patterns
- Same for all users - Guest and registered tests use the same model
Why This Matters
- Predictable behavior - You know what to expect
- Consistent costs - No surprise expensive model calls
- Easier debugging - One model to understand, not multiple
- No model-specific quirks - Results aren't affected by switching between models
Model Architecture
GPT-5 Mini (Text/Reasoning)
- Used for: All text-based reasoning, planning, and decision-making
- When: Every step of every test
- Retries: Yes, same-model retry (see Retry Policy below)
- Fallback: No - if it fails, the test fails
GPT-4o (Visual Analysis)
- Used for: Screenshot-based visual issue detection ONLY
- When: Selectively (final steps, layout shifts, errors)
- Retries: No - single attempt only
- Fallback: No - if it fails, visual analysis is skipped
Retry Policy
Same-Model Retry Envelope
What it is: A resilience guard for transient API failures (network blips, rate limits).
How It Works
- First attempt with GPT-5 Mini
- If failure is retryable (429, network error, timeout):
- Wait 200-400ms (randomized backoff)
- Retry with SAME model, SAME prompt
- Maximum 1 retry (2 total attempts)
- If retry fails: Test fails with clear error message
What Is Retryable
- 429 (rate limit exceeded) - Too many requests
- Network errors - Connection resets, timeouts, DNS failures
- Timeouts - Request took too long
What Is NOT Retryable
- 400 (bad request) - Configuration issue, invalid API key or model name
- 401 (unauthorized) - API key is invalid or expired
- Invalid responses - Malformed JSON, unexpected format
Guarantee
- Same model used for retry (no model switching)
- Same prompt used (deterministic behavior)
- Maximum 1 retry (bounded latency - 200-400ms)
Token Budgets
Per-Call Limits
Every LLM call has a strict token budget to prevent uncontrolled growth:
| Call Type | Token Budget | Purpose |
|---|---|---|
| Planning | 3000 | Test plan generation |
| Diagnosis | 3000 | UI diagnosis analysis |
| Testability | 2500 | Testability analysis |
| Action Generation | 2000 | Step-by-step actions |
| Cookie Banner | 1500 | Cookie banner detection |
| Error Analysis | 2000 | Error explanation |
| Self-Healing | 2000 | Alternative selector finding |
| Context Synthesis | 2500 | Multi-source context |
| Summary | 2000 | Final result summary |
DOM Pruning Rules
To stay within token budgets, DOM snapshots are pruned:
Removed:
<script>tags and content<style>tags and content- HTML comments
- Excessive whitespace
Kept:
- Interactive elements (buttons, inputs, links)
- Visible content
- Structural information
Truncation:
- Deterministic (from start, preserving end)
- Tag-boundary aware when possible
- Never random
Context Limiting
History:
- Limited to last 5 steps for action generation
- Older steps are discarded
Elements:
- Limited to 50-60 most relevant elements
- Hidden elements filtered out
- Prioritized by interactivity
DOM Snapshots:
- Pruned before inclusion in prompts
- Maximum length enforced per call type
- Deterministic truncation
What Happens on API Failure
GPT-5 Mini Failures
Scenario 1: Transient Error (429, network)
- Automatic retry after 200-400ms
- If retry succeeds: Test continues
- If retry fails: Test fails with error message
Scenario 2: Non-Retryable Error (400, 401)
- No retry attempted
- Test fails immediately
- Clear error message logged
Error Messages
GPT-5 Mini API error (400): [details]- Check API key and model nameGPT-5 Mini API authentication failed (401)- Check OPENAI_API_KEYGPT-5 Mini API rate limit exceeded (429)- Wait and retryGPT-5 Mini call failed after 2 attempt(s): [details]- Both attempts failed
GPT-4o Failures
Behavior:
- No retry attempted
- Visual analysis is skipped
- Test continues without visual issue detection
- Warning logged
Why:
- Visual analysis is optional/selective
- Failures don't block test execution
- Cost optimization (don't retry expensive calls)
What the Model CAN Do
- ✅ Generate test actions based on page context
- ✅ Analyze DOM structure and identify interactive elements
- ✅ Detect cookie banners and suggest dismissal strategies
- ✅ Explain test failures in plain English
- ✅ Find alternative selectors when primary fails
- ✅ Synthesize context from multiple sources (DOM, logs, errors)
- ✅ Perform testability analysis
- ✅ Generate structured test plans
What the Model CANNOT Do
- ❌ Guarantee 100% test success rate
- ❌ Handle all edge cases perfectly
- ❌ Work without valid API key
- ❌ Bypass rate limits
- ❌ Process unlimited context (token budgets enforced)
- ❌ Compare visual baselines (not implemented)
- ❌ Auto-fix code (read-only analysis)
Token Usage Guarantees
Guaranteed:
- No single call exceeds its token budget
- DOM pruning is deterministic (same input = same output)
- Context is limited to prevent unbounded growth
- Large DOMs don't cause unbounded latency
Not Guaranteed:
- Exact token counts (estimates used)
- Perfect pruning (some edge cases may slip through)
- Zero token waste (conservative estimates used)
Cost Implications
GPT-5 Mini
- Used for every reasoning step
- Token budgets limit per-call costs
- Retries add minimal cost (same model, same prompt)
- No fallback model costs
GPT-4o
- Used selectively (not every step)
- Only for visual analysis
- No retries (cost control)
- Failures don't block tests
Optimization
- DOM pruning reduces input tokens
- History limiting reduces context size
- Selective GPT-4o usage reduces visual analysis costs
Failure Handling Philosophy
Fail Fast
- Non-retryable errors fail immediately
- Clear error messages for debugging
- No silent degradation
Resilient
- Transient errors are retried once
- Same model ensures consistency
- Bounded retry latency (200-400ms)
Honest
- Errors are logged with full context
- No false success indicators
- User knows exactly what failed and why
No Fallback Models
Why
- Consistency: Same model = same behavior
- Predictability: No model-specific quirks
- Cost control: No expensive fallback calls
- Simplicity: Easier to debug and maintain
What Happens Instead
- Same-model retry for transient failures
- Clear error messages for permanent failures
- User can retry the test manually
Limitations
Known Limitations
- Token budgets may truncate very large DOMs
- Retry only handles transient failures
- GPT-4o failures are silent (visual analysis skipped)
- No baseline comparison for visual tests
- Model responses are not cached between tests
Acceptable Trade-offs
- Deterministic truncation (predictable behavior)
- Bounded retries (controlled latency)
- Selective visual analysis (cost optimization)
- Fail-fast on permanent errors (clear feedback)
Support & Troubleshooting
Common Issues
1. "GPT-5 Mini API error (400)"
- Check OPENAI_API_KEY is valid
- Verify model name is 'gpt-5-mini'
- Check API key has proper permissions
2. "GPT-5 Mini API authentication failed (401)"
- OPENAI_API_KEY is invalid or expired
- Regenerate API key in OpenAI dashboard
3. "GPT-5 Mini API rate limit exceeded (429)"
- Wait a few minutes
- Check OpenAI usage limits
- Retry the test
4. Test fails immediately after starting
- Check API key is set in .env file
- Verify network connectivity
- Check OpenAI service status
5. Large DOMs causing issues
- DOM pruning should handle this automatically
- Token budgets prevent unbounded growth
- If issues persist, check DOM size in logs
Summary
- One model per test: GPT-5 Mini for all reasoning
- Retry policy: Same-model retry, max 1 retry, 200-400ms backoff
- Token budgets: Strict per-call limits with DOM pruning
- Failure handling: Fail-fast with clear errors, retry transient failures
- No fallbacks: Consistency over complexity
- Honest limitations: Documented and acknowledged
This architecture prioritizes predictability, cost control, and clear failure modes over complex fallback strategies.