Method Overview
Core Philosophy
GrowingNN is fundamentally built upon Stochastic Gradient Descent (SGD) and relies on basic methods, avoiding advanced tools commonly used in training neural networks. This simplicity makes it ideal for educational settings, focusing on fundamental machine learning principles.
Model Architecture
Graph-Based Structure
The model is the main structure that stores layers as nodes in a directed acyclic graph (DAG). Key characteristics:
- Layer Identifiers: The system operates on layer identifiers rather than sequential positions
- Independent Layers: Each layer is an independent structure that manages its own:
- Incoming connections (input layers)
- Outgoing connections (output layers)
- Default Structure: Starts with the smallest possible setup:
- One input layer
- One output layer
- A single connection between them
Evolution Process
As training progresses through generations:
- Layer Addition: New layers may be added between existing layers
- Layer Removal: Existing layers may be removed (with automatic reconnection)
- Connection Growth: As structure grows, each layer gains more incoming and outgoing connections
- Neuron Modification: Neurons can be added or removed within layers
Training Process
Generation-Based Training
Training is divided into generations, where each generation consists of:
- Training Phase: Multiple epochs of gradient descent
- Structure Modification Phase: Potential architectural changes based on simulation
Propagation Mechanism
During forward propagation:
- Signal Collection: Each layer waits until it receives signals from all input layers
- Signal Averaging: Once all signals are received, they are averaged together
- Processing: The averaged signal is processed through the layer's transformation (weights, activation)
- Propagation: The processed signal is propagated through all outgoing connections
This mechanism allows the network to handle: - Multiple input paths - Residual connections - Complex graph topologies
Simulation Orchestrator
Purpose
The simulation orchestrator (implemented as canSimulate()) determines when a simulation should be executed during the learning process. This timing is critical for maintaining balance between:
- Exploration: Trying new structures
- Exploitation: Fully training current structures
The Balance Problem
- Too Frequent Changes: May prevent a particular structure from full training
- Too Infrequent Changes: May lead to:
- Continuously falling into local minima
- Excessively long learning time
- Inefficient method
Progress Check Method
The default approach is progress check:
- After each generation, check if there has been improvement in the model's learning
- If no improvement is detected, run the simulation
- If improvement is detected, continue training without structural changes
This means: If a given structure is capable of learning the dataset, training continues without changes to the structure.
Implementation
simulation_scheduler = gnn.SimulationScheduler(
gnn.SimulationScheduler.PROGRESS_CHECK, # Progress-based triggering
simulation_time=300,
simulation_epochs=20,
progres_delta=0.01 # Minimum improvement threshold
)
Progressive Learning Rate
Custom Implementation
The algorithm implements a custom modification of the progressive learning rate based on Schaul et al. This learning rate schedule plays a crucial role in minimizing the impact of structural changes.
Behavior Pattern
Within a single generation, the learning rate follows this pattern:
- Initial Epoch (after structure change): Learning rate is very close to zero
- Growth Phase: Learning rate grows to a constant maximum value
- Constant Phase: Maintains maximum value
- Decay Phase: Learning rate slowly decreases before the next potential action
Mathematical Description
The learning rate \(\alpha(t)\) within a generation of \(T\) epochs:
- Early epochs: \(\alpha(t) \approx 0\) (minimal learning)
- Middle epochs: \(\alpha(t)\) increases to \(\alpha_{\max}\)
- Late epochs: \(\alpha(t)\) decreases from \(\alpha_{\max}\)
Purpose
This pattern minimizes the negative impact of introduced changes on already learned information in the network. By starting with a very low learning rate after structural modifications, the network can:
- Preserve existing knowledge
- Gradually adapt to the new structure
- Avoid catastrophic forgetting
Implementation
lr_scheduler = gnn.LearningRateScheduler(
gnn.LearningRateScheduler.PROGRESIVE, # Progressive schedule
initial_lr=0.005, # Maximum learning rate
decay_factor=0.8 # Decay rate
)
Connection Types
Sequential Connections
Definition: Direct connections between layers in sequence.
Action: Add_Seq_Layer adds a layer between two directly connected layers, breaking the direct connection and inserting the new layer.
Matrix-Extended Residual Connections
Definition: Skip connections that bypass one or more layers, but unlike traditional ResNet, these use direct connections between neurons rather than summation points.
Key Difference from ResNet: - Traditional ResNet: Residual connections merge using summation points: \(y = f(x) + x\) - GrowingNN: Matrix-extended residual connections use direct neuron-to-neuron connections
Benefits: - More granular control over information flow - Makes neuron removal more effective - Directly affects all connected layers when neurons are removed
Action: Add_Res_Layer adds a layer with residual connections between non-directly connected layers.
Layer Removal
Action: Del_Layer removes an existing layer and automatically reconnects:
- Input layers of the removed layer → Output layers of the removed layer
- Maintains graph connectivity
- Prevents cycles in the structural graph
Action Types
The algorithm modifies architecture using three main types of actions:
- Sequential Layer Addition: Add layer between directly connected layers
- Residual Layer Addition: Add layer with skip connections
- Layer Removal: Remove existing layer with automatic reconnection
Additionally, the system supports: 4. Neuron Removal: Reduce neurons in a layer (10%, 50%, 90%) 5. Neuron Addition: Expand neurons in a layer (150%, 200%)
Algorithm Flow
Initialization:
SimSet ← Create simulation set
Model ← Create model with basic structure (input → output)
For each generation:
1. GradientDescent(Model, Dataset, epochs)
- Train current structure
- Use progressive learning rate
2. If canSimulate():
- Check if learning has improved
- If no improvement:
a. Generate all possible actions
b. Action ← MCTS.simulation(Model)
c. Model ← Action.execute(Model)
d. Adjust learning rate (start near zero)
Use Cases
1. Training from Scratch
The primary use case: train a network that dynamically evolves its structure during training.
Characteristics: - Starts from minimal structure - Grows based on learning needs - Structure changes only when learning stalls - Focus on accuracy improvement
2. Optimizing Existing Networks
Secondary use case: take an already-trained network and optimize it for parameter efficiency.
Characteristics: - Starts from pre-trained model - Focus on parameter reduction - Maintains accuracy while reducing size - Two-phase approach recommended
Educational Value
The simplicity of the approach makes it valuable for:
- Understanding SGD: Core training mechanism
- Learning Neural Networks: Basic forward/backward propagation
- Architecture Search: How structures can evolve
- Graph-Based Models: Understanding DAG representations
- No Black Boxes: Avoids advanced frameworks, focuses on fundamentals
References
- Stochastic Gradient Descent: Core optimization method
- Schaul et al.: Progressive learning rate inspiration
- Monte Carlo Tree Search: Action selection mechanism
- ResNet: Inspiration for residual connections (extended in this work)