packages = ["bokeh", "pandas", "numpy", "networkx", "diagrams", "scikit-learn", "pillow", "matplotlib", "plotly"] [[fetch]] files = ["diagrams_base.py"] from = "../../python/diagrams/" [[fetch]] files = ["pyscript_manager.py", "data.py"] to_folder = "lib" from = "../../python/lib/" [[fetch]] files = ["bokeh_utils.py"] from = "../../python/bokeh/" [[fetch]] files = ["matplotlib_utils.py", "plotly_utils.py"] from = "../../python/matplotlib/" [[fetch]] files = ["agent.py", "trainer.py", "utils.py", "metrics_chart.py", "crossover.py", "__init__.py"] to_folder = "ml/neuro" from = "../../python/ml/neuro/" [[fetch]] files = ["trainer.py"] to_folder = "ml/grokking" from = "../../python/ml/grokking/"

Loading...

Generate a maze to begin

Episode
0
Steps
0
Reward
0
Epsilon (ε)
1.00
Success Rate
0.0%

Loading Python modules...

🎨 Visualization Colors

Here's what each color represents in the maze:

  • 🟢 Green walls - Maze structure boundaries
  • Green circle - Start position (where agent begins)
  • Red circle - Goal/target position (objective)
  • Blue circle - Agent position (AI solver)

Q-Value Heatmap (when "Visualize" is toggled):

Low Q-value (blue) High Q-value (yellow)

Brighter/yellower cells = agent learned these positions are more valuable (closer to goal)

🎓 What is Q-Learning?

Q-Learning is a value-based reinforcement learning algorithm that learns the optimal action to take in each state by maintaining a Q-table:

  • Q-Table: Maps state-action pairs to expected rewards (Q-values)
  • Bellman Equation: Q(s,a) = Q(s,a) + α[r + γ·max(Q(s',a')) - Q(s,a)]
  • α (Alpha): Learning rate (difficulty-adjusted) - how much new info overrides old
  • γ (Gamma): Discount factor (0.95) - importance of future rewards
  • ε (Epsilon): Exploration rate (starts at 1.0, decays based on difficulty)

🔍 Exploration vs Exploitation

The agent uses an epsilon-greedy strategy to balance learning and performance:

  • Exploration (ε = 1.0 → 0.01): Take random actions to discover new paths
  • Exploitation (1 - ε): Use learned Q-values to take best known action
  • Epsilon Decay: Gradually shift from exploration to exploitation as agent learns

Watch the Epsilon (ε) metric decrease over time - as it approaches 0.01, the agent transitions from random exploration to exploiting learned strategies!

🎯 Reward Structure

The agent receives rewards for its actions:

  • Hit Wall: -1.0 (strong discouragement)
  • Normal Move: -0.5 base penalty + directional bonus
    • Move closer to goal: +0.1 (reward shaping guides learning)
    • Move away from goal: -0.1 (discourages wrong direction)
  • Goal Reached: +100 (big success!)
  • Episode Limit: 1000 steps max to prevent infinite loops

The reward shaping (directional bonus) acts like a compass 🧭, guiding the agent toward the goal while still discovering the optimal path through Q-learning!

🧠 How Q-Learning Learns

  1. Initialize: Start with empty Q-table (all values = 0)
  2. Episode Loop: Agent spawns at 🟢 green position
  3. Choose Action: ε-greedy (explore random or exploit best Q-value)
  4. Execute: Move agent, observe reward and next state
  5. Update Q-Value: Apply Bellman equation to learn from experience
  6. Repeat: Until reaching 🔴 red goal or step limit (agents blink rapidly between cells)
  7. Decay ε: Reduce exploration rate based on maze difficulty
  8. Next Episode: Reset agent, repeat with updated Q-values

Over time, Q-values propagate backward from the goal, creating a "gradient" that guides the agent! You can see this gradient visualized when you toggle "Visualize" (brighter = higher Q-value).

✨ Smart Features:

  • Difficulty-Adjusted Learning: Easy mazes learn faster (higher α), hard mazes explore more (higher ε decay)
  • Reward Shaping: Direction bonus helps agent find goal faster without knowing maze structure
  • Demo Mode: Click "Demo" to see pure exploitation (ε=0) - shows what agent truly learned without exploration!

⚡ Technical Architecture

  • Maze Generation: Recursive backtracker algorithm (depth-first search)
  • State Space: Discrete grid (row, col) positions
  • Action Space: 4 discrete actions (UP, DOWN, LEFT, RIGHT)
  • Q-Learning Backend: Python (PyScript) with Q-table dictionary storage
  • Reward Shaping: Manhattan distance bonus to guide learning
  • Difficulty-Adapted Learning:
    • Easy: α=0.35, decay=0.96 (fast learning)
    • Medium: α=0.3, decay=0.98 (balanced)
    • Hard: α=0.25, decay=0.985 (cautious learning)
    • Insane: α=0.2, decay=0.99 (extensive exploration)
  • Animation System: Frame-based tweening (60fps) for demo mode smooth movement
  • Visualization: Canvas rendering with Q-value heatmap overlay (blue→yellow gradient)
  • Training Speed: ~4 episodes for 50% success on easy mode!

Q-Learning Maze Solver

Watch an AI agent learn to solve mazes using Q-Learning, the classic tabular reinforcement learning algorithm. Perfect introduction to RL fundamentals!

🎯 How to Use

  1. Generate Maze:​ Click "Generate" to create a random maze. Try different difficulties!
  2. Start Training:​ Click "Start Training" to begin Q-learning Watch the 🔵 blue agent blink rapidly between cells as it explores!
  3. Visualize Q-Values:​ Click "Visualize" to toggle the heatmap overlay Blue cells = low Q-value | Yellow cells = high Q-value (closer to goal)
  4. Monitor Metrics:​ Watch Success Rate climb and Epsilon decay as the agent learns
  5. Demo Mode:​ Once trained, click "Demo" to see smooth, confident movement Agent moves smoothly (tweened) because it's using pure exploitation (ε=0)
  6. Compare Difficulties:​ Try Easy (10×10) vs Insane (30×30) to see how learning adapts!

💡 Pro Tip: Training = agent blinks frantically between spots (exploring). Demo = agent glides smoothly (exploiting learned knowledge)!

🌟 Why Q-Learning?

Q-Learning is the perfect introduction to reinforcement learning because:

  • Intuitive: Easy to visualize and understand
  • Model-Free: No knowledge of maze structure needed
  • Off-Policy: Learns optimal policy while exploring
  • Guaranteed Convergence: Proven to find optimal solution
  • Foundation for Deep RL: Basis for DQN, Double DQN, etc.

⚠️ Q-Table Limitations

Q-tables work great for discrete, small state spaces like mazes. But they don't scale:

  • Memory: A 30×30 maze has 900 states × 4 actions = 3,600 Q-values
  • Continuous States: Can't handle pixel inputs or continuous observations
  • Generalization: Each state learned independently, no transfer

That's why complex games like Mario use neural networks (function approximation) instead of Q-tables. Check out the Neuroevolution example to see how!

View source (MazeRLController.js)

🐍 Python Console