α (Alpha): Learning rate (difficulty-adjusted) - how much new info overrides old
γ (Gamma): Discount factor (0.95) - importance of future rewards
ε (Epsilon): Exploration rate (starts at 1.0, decays based on difficulty)
🔍 Exploration vs Exploitation
The agent uses an epsilon-greedy strategy to balance learning and performance:
Exploration (ε = 1.0 → 0.01): Take random actions to discover new paths
Exploitation (1 - ε): Use learned Q-values to take best known action
Epsilon Decay: Gradually shift from exploration to exploitation as agent learns
Watch the Epsilon (ε) metric decrease over time - as it approaches 0.01, the agent transitions from random exploration to exploiting learned strategies!
🎯 Reward Structure
The agent receives rewards for its actions:
Hit Wall: -1.0 (strong discouragement)
Normal Move: -0.5 base penalty + directional bonus
Move closer to goal: +0.1 (reward shaping guides learning)
Move away from goal: -0.1 (discourages wrong direction)
Goal Reached: +100 (big success!)
Episode Limit: 1000 steps max to prevent infinite loops
The reward shaping (directional bonus) acts like a compass 🧭, guiding the agent toward the goal while still discovering the optimal path through Q-learning!
🧠 How Q-Learning Learns
Initialize: Start with empty Q-table (all values = 0)
Episode Loop: Agent spawns at 🟢 green position
Choose Action: ε-greedy (explore random or exploit best Q-value)
Execute: Move agent, observe reward and next state
Update Q-Value: Apply Bellman equation to learn from experience
Repeat: Until reaching 🔴 red goal or step limit (agents blink rapidly between cells)
Decay ε: Reduce exploration rate based on maze difficulty
Next Episode: Reset agent, repeat with updated Q-values
Over time, Q-values propagate backward from the goal, creating a "gradient" that guides the agent! You can see this gradient visualized when you toggle "Visualize" (brighter = higher Q-value).
✨ Smart Features:
Difficulty-Adjusted Learning: Easy mazes learn faster (higher α), hard mazes explore more (higher ε decay)
Reward Shaping: Direction bonus helps agent find goal faster without knowing maze structure
Demo Mode: Click "Demo" to see pure exploitation (ε=0) - shows what agent truly learned without exploration!
Q-Learning Backend: Python (PyScript) with Q-table dictionary storage
Reward Shaping: Manhattan distance bonus to guide learning
Difficulty-Adapted Learning:
Easy: α=0.35, decay=0.96 (fast learning)
Medium: α=0.3, decay=0.98 (balanced)
Hard: α=0.25, decay=0.985 (cautious learning)
Insane: α=0.2, decay=0.99 (extensive exploration)
Animation System: Frame-based tweening (60fps) for demo mode smooth movement
Visualization: Canvas rendering with Q-value heatmap overlay (blue→yellow gradient)
Training Speed: ~4 episodes for 50% success on easy mode!
Q-Learning Maze Solver
Watch an AI agent learn to solve mazes using Q-Learning, the classic tabular reinforcement learning algorithm. Perfect introduction to RL fundamentals!
🎯 How to Use
Generate Maze: Click "Generate" to create a random maze. Try different difficulties!
Start Training: Click "Start Training" to begin Q-learning Watch the 🔵 blue agent blink rapidly between cells as it explores!
Visualize Q-Values: Click "Visualize" to toggle the heatmap overlay Blue cells = low Q-value | Yellow cells = high Q-value (closer to goal)
Monitor Metrics: Watch Success Rate climb and Epsilon decay as the agent learns
Demo Mode: Once trained, click "Demo" to see smooth, confident movement Agent moves smoothly (tweened) because it's using pure exploitation (ε=0)
Compare Difficulties: Try Easy (10×10) vs Insane (30×30) to see how learning adapts!
💡 Pro Tip: Training = agent blinks frantically between spots (exploring). Demo = agent glides smoothly (exploiting learned knowledge)!
🌟 Why Q-Learning?
Q-Learning is the perfect introduction to reinforcement learning because:
Intuitive: Easy to visualize and understand
Model-Free: No knowledge of maze structure needed
Off-Policy: Learns optimal policy while exploring
Guaranteed Convergence: Proven to find optimal solution
Foundation for Deep RL: Basis for DQN, Double DQN, etc.
⚠️ Q-Table Limitations
Q-tables work great for discrete, small state spaces like mazes. But they don't scale:
Memory: A 30×30 maze has 900 states × 4 actions = 3,600 Q-values
Continuous States: Can't handle pixel inputs or continuous observations
Generalization: Each state learned independently, no transfer
That's why complex games like Mario use neural networks (function approximation) instead of Q-tables. Check out the Neuroevolution example to see how!