PyScript Examples

Generate a maze to begin

Episode

Steps

Reward

Epsilon (ε)

1.00

Success Rate

0.0%

⏳ Loading Python modules...

🎨 Visualization Colors

Here's what each color represents in the maze:

🟢 Green walls - Maze structure boundaries
● Green circle - Start position (where agent begins)
● Red circle - Goal/target position (objective)
● Blue circle - Agent position (AI solver)

Q-Value Heatmap (when "Visualize" is toggled):

Low Q-value (blue) → High Q-value (yellow)

Brighter/yellower cells = agent learned these positions are more valuable (closer to goal)

🎓 What is Q-Learning?

Q-Learning is a value-based reinforcement learning algorithm that learns the optimal action to take in each state by maintaining a Q-table:

Q-Table: Maps state-action pairs to expected rewards (Q-values)
Bellman Equation: Q(s,a) = Q(s,a) + α[r + γ·max(Q(s',a')) - Q(s,a)]
α (Alpha): Learning rate (difficulty-adjusted) - how much new info overrides old
γ (Gamma): Discount factor (0.95) - importance of future rewards
ε (Epsilon): Exploration rate (starts at 1.0, decays based on difficulty)

🔍 Exploration vs Exploitation

The agent uses an epsilon-greedy strategy to balance learning and performance:

Exploration (ε = 1.0 → 0.01): Take random actions to discover new paths
Exploitation (1 - ε): Use learned Q-values to take best known action
Epsilon Decay: Gradually shift from exploration to exploitation as agent learns

Watch the Epsilon (ε) metric decrease over time - as it approaches 0.01, the agent transitions from random exploration to exploiting learned strategies!

🎯 Reward Structure

The agent receives rewards for its actions:

Hit Wall: -1.0 (strong discouragement)
Normal Move: -0.5 base penalty + directional bonus
- Move closer to goal: +0.1 (reward shaping guides learning)
- Move away from goal: -0.1 (discourages wrong direction)
Goal Reached: +100 (big success!)
Episode Limit: 1000 steps max to prevent infinite loops

The reward shaping (directional bonus) acts like a compass 🧭, guiding the agent toward the goal while still discovering the optimal path through Q-learning!

🧠 How Q-Learning Learns

Initialize: Start with empty Q-table (all values = 0)
Episode Loop: Agent spawns at 🟢 green position
Choose Action: ε-greedy (explore random or exploit best Q-value)
Execute: Move agent, observe reward and next state
Update Q-Value: Apply Bellman equation to learn from experience
Repeat: Until reaching 🔴 red goal or step limit (agents blink rapidly between cells)
Decay ε: Reduce exploration rate based on maze difficulty
Next Episode: Reset agent, repeat with updated Q-values

Over time, Q-values propagate backward from the goal, creating a "gradient" that guides the agent! You can see this gradient visualized when you toggle "Visualize" (brighter = higher Q-value).

✨ Smart Features:

Difficulty-Adjusted Learning: Easy mazes learn faster (higher α), hard mazes explore more (higher ε decay)
Reward Shaping: Direction bonus helps agent find goal faster without knowing maze structure
Demo Mode: Click "Demo" to see pure exploitation (ε=0) - shows what agent truly learned without exploration!

⚡ Technical Architecture

Maze Generation: Recursive backtracker algorithm (depth-first search)
State Space: Discrete grid (row, col) positions
Action Space: 4 discrete actions (UP, DOWN, LEFT, RIGHT)
Q-Learning Backend: Python (PyScript) with Q-table dictionary storage
Reward Shaping: Manhattan distance bonus to guide learning
Difficulty-Adapted Learning:
- Easy: α=0.35, decay=0.96 (fast learning)
- Medium: α=0.3, decay=0.98 (balanced)
- Hard: α=0.25, decay=0.985 (cautious learning)
- Insane: α=0.2, decay=0.99 (extensive exploration)
Animation System: Frame-based tweening (60fps) for demo mode smooth movement
Visualization: Canvas rendering with Q-value heatmap overlay (blue→yellow gradient)
Training Speed: ~4 episodes for 50% success on easy mode!

Q-Learning Maze Solver

Watch an AI agent learn to solve mazes using Q-Learning, the classic tabular reinforcement learning algorithm. Perfect introduction to RL fundamentals!

🎯 How to Use

Generate Maze: Click "Generate" to create a random maze. Try different difficulties!
Start Training: Click "Start Training" to begin Q-learning Watch the 🔵 blue agent blink rapidly between cells as it explores!
Visualize Q-Values: Click "Visualize" to toggle the heatmap overlay Blue cells = low Q-value | Yellow cells = high Q-value (closer to goal)
Monitor Metrics: Watch Success Rate climb and Epsilon decay as the agent learns
Demo Mode: Once trained, click "Demo" to see smooth, confident movement Agent moves smoothly (tweened) because it's using pure exploitation (ε=0)
Compare Difficulties: Try Easy (10×10) vs Insane (30×30) to see how learning adapts!

💡 Pro Tip: Training = agent blinks frantically between spots (exploring). Demo = agent glides smoothly (exploiting learned knowledge)!

🌟 Why Q-Learning?

Q-Learning is the perfect introduction to reinforcement learning because:

Intuitive: Easy to visualize and understand
Model-Free: No knowledge of maze structure needed
Off-Policy: Learns optimal policy while exploring
Guaranteed Convergence: Proven to find optimal solution
Foundation for Deep RL: Basis for DQN, Double DQN, etc.

⚠️ Q-Table Limitations

Q-tables work great for discrete, small state spaces like mazes. But they don't scale:

Memory: A 30×30 maze has 900 states × 4 actions = 3,600 Q-values
Continuous States: Can't handle pixel inputs or continuous observations
Generalization: Each state learned independently, no transfer

That's why complex games like Mario use neural networks (function approximation) instead of Q-tables. Check out the Neuroevolution example to see how!

View source (MazeRLController.js)

Previous Next

Loading...