academia.agents module
Submodules
Module contents
This module contains implementations of reinforcement learning algorithms, including tabular and those based on neural networks.
Exported classes:
Note
DQNAgent
and PPOAgent
need to be provided network architectures when initializing.
These network architectures should be subclasses of torch.nn.Module
. Example architectures
can be found in academia.utils.models
.
- class academia.agents.DQNAgent(nn_architecture: Type[Module], n_actions: int, gamma: float = 0.99, epsilon: float = 1.0, epsilon_decay: float = 0.995, min_epsilon: float = 0.01, batch_size: int = 64, random_state: int | None = None, replay_memory_size: int = 100000, lr: float = 0.0005, tau: float = 0.001, update_every: int = 3, device: Literal['cpu', 'cuda'] = 'cpu')
Bases:
EpsilonGreedyAgent
Class representing a Deep Q-Network (DQN) agent for reinforcement learning tasks.
The DQNAgent class implements the Deep Q-Network (DQN) algorithm for reinforcement learning tasks. It uses a neural network to approximate the Q-values of actions in a given environment. The agent learns from experiences stored in a replay memory and performs updates to its Q-values during training episodes. The target network is soft updated to stabilize training.
- Parameters:
nn_architecture – Type of neural network architecture to be used.
n_actions – Number of possible actions in the environment.
gamma – Discount factor for future rewards. Defaults to 0.99.
epsilon – Initial exploration-exploitation trade-off parameter. Defaults to 1.0.
epsilon_decay – Decay factor for epsilon over time. Defaults to 0.995.
min_epsilon – Minimum epsilon value to ensure exploration. Defaults to 0.01.
batch_size – Size of the mini-batch used for training. Defaults to 64.
random_state – Seed for random number generation. Defaults to
None
.replay_memory_size – Maximum size of the replay memory. Defaults to 100000.
lr – Learning rate for the optimizer. Defaults to 0.0005.
tau – Interpolation parameter for target network soft updates. Defaults to 0.001.
update_every – Frequency of network updates. Defaults to 3.
device – Device to use for training. Defaults to
cpu
.
- nn_architecture
Type of neural network architecture to be used.
- Type:
Type[nn.Module]
- epsilon
Exploration-exploitation trade-off parameter.
- Type:
float
- min_epsilon
Minimum value for epsilon during exploration.
- Type:
float
- epsilon_decay
Decay rate for epsilon.
- Type:
float
- n_actions
Number of possible actions in the environment.
- Type:
int
- gamma
Discount factor.
- Type:
float
- memory
Replay memory used to store experiences for training.
- Type:
deque
- batch_size
Size of the mini-batch used for training.
- Type:
int
- network
Neural network used to approximate Q-values.
- Type:
nn.Module
- target_network
Target network used to stabilize training.
- Type:
nn.Module
- optimizer
Optimizer used for training.
- Type:
optim.Optimizer
- experience
Named tuple representing an experience tuple which stores state, action, reward, new_state, and done.
- Type:
namedtuple
- train_step
Counter for the number of training steps performed.
- Type:
int
- replay_memory_size
Maximum size of the replay memory.
- Type:
int
- lr
Learning rate for the optimizer.
- Type:
float
- tau
Interpolation parameter for target network soft updates.
- Type:
float
- update_every
Frequency of network updates.
- Type:
int
- device
Device used for training.
- Type:
Literal[‘cpu’, ‘cuda’]
Examples
>>> from academia.agents import DQNAgent >>> from academia.environments import DoorKey >>> # Import custom neural network architecture >>> from academia.utils.models import door_key >>> >>> # Create an environment: >>> env = DoorKey(difficulty=0, append_step_count=True) >>> # Create an instance of the DQNAgent class with >>> # custom neural network architecture >>> dqn_agent = DQNAgent( >>> nn_architecture=door_key.MLPStepDQN, >>> n_actions=DoorKey.N_ACTIONS, >>> gamma=0.99, >>> epsilon=1.0, >>> epsilon_decay=0.99, >>> min_epsilon=0.01, >>> batch_size=64, >>> ) >>> # Training loop: Update the agent using experiences >>> # (state, action, reward, new_state, done) >>> num_episodes = 100 >>> for episode in range(num_episodes): >>> state = env.reset() >>> done = False >>> while not done: >>> action = dqn_agent.get_action(state) >>> new_state, reward, terminated = env.step(action) >>> if terminated: >>> done = True >>> dqn_agent.update(state, action, reward, new_state, done) >>> state = new_state >>> >>> # Save the agent's state dictionary to a file >>> dqn_agent.save('dqn_agent') >>> >>> # Load the agent's state dictionary from a file >>> dqn_agent = DQNAgent.load('dqn_agent')
Note
Ensure that the custom neural network architecture passed to the constructor inherits from
torch.nn.Module
and is appropriate for the task.The agent’s exploration-exploitation strategy is based on epsilon-greedy method.
The
__soft_update_target()
method updates the target network weights from the main network’s weights based on strategy target_weights =tau
* main_weights + (1 -tau
) * target_weights, wheretau
<< 1.It is recommended to adjust hyperparameters such as gamma, epsilon, epsilon_decay, and batch_size based on the specific task and environment.
- get_action(state: Any, legal_mask: ndarray[Any, dtype[int32]] | None = None, greedy: bool = False) int
Selects an action based on the current state using the epsilon-greedy strategy.
- Parameters:
state – The current state representation used to make the action selection decision.
legal_mask – A binary mask indicating the legality of actions. If provided, restricts the agent’s choices to legal actions.
greedy – A boolean flag indicating whether to force a greedy action selection. If True, the function always chooses the action with the highest Q-value, ignoring exploration.
- Returns:
The index of the selected action.
- classmethod load(path: str) DQNAgent
Loads the state dictionary of the neural network model, target network model and agent parameters from the specified file.
- Parameters:
path – Path to a file from which to load the model’s state dictionary.
- Returns:
A loaded instance of
PPOAgent
.
- save(path: str) str
Saves the state dictionary of the neural network model, target network model and agent parameters to the specified file path.
- Parameters:
path – Path to a file (including filename and extension) where the model’s state dictionary will be saved.
- Returns:
An absolute path to the saved file.
- update(state: Any, action: int, reward: float, new_state: Any, is_terminal: bool)
Updates the DQN network weights to better estimate Q-values of every action.
- Parameters:
state – Current state of the environment.
action – Action taken in the current state.
reward – Reward received after taking the action.
new_state – Next state of the environment after taking the action.
is_terminal – A flag indicating whether the new state is a terminal state.
- class academia.agents.PPOAgent(actor_architecture: Type[Module], critic_architecture: Type[Module], n_actions: int, discrete: bool = True, batch_size: int = 64, n_epochs: int = 5, n_steps: int | None = None, n_episodes: int | None = 10, clip: float = 0.2, lr: float = 0.0003, covariance_fill: float = 0.5, entropy_coefficient: float = 0.01, gamma: float = 0.99, random_state: int | None = None, device: Literal['cpu', 'cuda'] = 'cpu')
Bases:
Agent
Class representing a Proximal Policy Optimization (PPO) agent for reinforcement learning tasks. Paper on PPO: https://arxiv.org/pdf/1707.06347.pdf
- Parameters:
actor_architecture – Type of neural network architecture to be used for the actor.
critic_architecture – Type of neural network architecture to be used. for the critic.
n_actions – Number of possible actions in the environment.
discrete – Whether the agent’s action space is discrete. Defaults to
True
.batch_size – The size of the minibatch used during training. Defaults to 64.
n_epochs – Number of epochs per training. Defaults to 5.
n_steps – Minimum number of steps to take between training sessions. Note that if the minimum is reached during an episode the episode will still finish and the remaining steps will be included in the buffer. If set to None
n_episodes
will be used instead. Exactly one ofn_steps
andn_episodes
must be notNone
. Defaults toNone
.n_episodes – Number of episodes to take between training sessions. If set to None
n_steps
will be used instead. Exactly one ofn_steps
andn_episodes
must be notNone
. Defaults to 10.clip – Clip rate hyperparameter from the PPO algorithm. Defaults to 0.2.
lr – Learning rate used by (Adam) optimizers. The same value is used for both actor and critic. Defaults to
3e-4
.covariance_fill – Value on the diagonal in the covariance matrix used to randomly sample continuous actions when
discrete
isFalse
. Defaults to 0.5.entropy_coefficient – Coefficient used to control the impact of entropy on the loss function. Defaults to 0.01.
gamma – Discount factor for future rewards. Defaults to 0.99.
random_state – Seed for random number generation. Defaults to
None
.device – Device used for computation. Possible values are
'cuda'
and'cpu'
. Defaults to'cpu'
.
- n_actions
Number of possible actions in the environment.
- Type:
int
- gamma
Discount factor for future rewards.
- Type:
float
- discrete
Whether the agent’s action space is discrete.
- Type:
bool
- clip
Clip rate hyperparameter from the PPO algorithm.
- Type:
float
- lr
Learning rate used by (Adam) optimizers.
- Type:
float
- entropy_coefficient
Coefficient used to control the impact of entropy on the loss function.
- Type:
float
- batch_size
The size of the minibatch used during training.
- Type:
int
- n_epochs
Number of epochs per training.
- Type:
int
- device
Device used for computation.
- Type:
Literal[‘cpu’, ‘cuda’]
- buffer
Trajectory buffer. This object contains all transitions gathered since the last training session.
- Type:
- actor
Actor neural network.
- Type:
nn.Module
- critic
Critic neural network.
- Type:
nn.Module
- actor_architecture
Type of neural network architecture to be used for the actor.
- Type:
Type[nn.Module]
- critic_architecture
Type of neural network architecture to be used for the critic.
- Type:
Type[nn.Module]
Examples
>>> from academia.agents import PPOAgent >>> from academia.environments import LavaCrossing >>> from academia.curriculum import LearningTask >>> from academia.utils.models import lava_crossing >>> >>> task = LearningTask( >>> LavaCrossing, >>> env_args={'difficulty': 0}, >>> stop_conditions={'max_episodes': 100} >>> ) >>> agent = PPOAgent( >>> actor_architecture=lava_crossing.MLPActor, >>> critic_architecture=lava_crossing.MLPCritic, >>> n_actions=3 >>> ) >>> task.run(agent)
Note
PPOAgent currently does not support legal masks.
PPOAgent currently does not provide implementations for
update_exploration()
orreset_exploration()
methods.
- class PPOBuffer(n_steps: int | None = None, n_episodes: int | None = None)
Bases:
object
Class representing the buffer of PPOAgent
- Parameters:
n_steps – Minimum number of steps to take between training sessions. Note that if the minimum is reached during an episode the episode will still finish and the remaining steps will be included in the buffer. If set to None
n_episodes
will be used instead. Exactly one ofn_steps
andn_episodes
must be notNone
. Defaults toNone
.n_episodes – Number of episodes to take between training sessions. If set to None
n_steps
will be used instead. Exactly one ofn_steps
andn_episodes
must be notNone
. Defaults toNone
.
- n_steps
Minimum number of steps to take between training sessions.
- Type:
int
- n_episodes
Number of episodes to take between training sessions.
- Type:
int
- episode_length_counter
Length of the currently running episode.
- Type:
int
- steps_counter
Number of steps stored inside the buffer.
- Type:
int
- episode_counter
Number of full episodes stored inside the buffer.
- Type:
int
- states
List containing observed states.
- Type:
list
- actions
List containing actions taken.
- Type:
list
- actions_logits
List containing logits of actions taken.
- Type:
list
- rewards
List of obtained rewards.
- Type:
list
- rewards_to_go
List of discounted rewards. Note that it is only calculated right before the training and is cleared afterwards.
- Type:
list
- episode_lengths
List containing the lengths of buffered episodes.
- Type:
list
- calculate_rewards_to_go(gamma: float) None
Calculates the discounted rewards for each buffered episode.
- Parameters:
gamma – Discount factor
- get_tensors(device: Literal['cpu', 'cuda']) Tuple[FloatTensor, FloatTensor, FloatTensor, FloatTensor]
Calculates the discounted rewards for each buffered episode
- Parameters:
device – Target computation device
- Returns:
A 4-element tuple containing states, actions, actions logits and discounted rewards in that order converted to tensors.
- reset() None
Clears the buffer and resets it to the initial state.
- update(state: Any, action: Any, action_logit: float, reward: float, is_terminal: bool) bool
Updates the buffer with the provided transition attriutes.
- Parameters:
state – Observed state of the environment.
action – Action taken by the agent.
action_logit – Logit of the action taken by the agent.
reward – Reward obtained by the agent.
is_terminal – Whether the resulting new state is terminal.
- Returns:
Whether the buffer is full and the current episode is terminated.
- get_action(state: Any, legal_mask: ndarray[Any, dtype[int32]] | None = None, greedy: bool = False) float | int
Selects an action based on the current state.
- Parameters:
state – The current state representation used to make the action selection decision.
legal_mask – A binary mask indicating the legality of actions. If provided, restricts the agent’s choices to legal actions. Note that currently PPOAgent does not support legal masks.
greedy – A boolean flag indicating whether to force a greedy action selection.
- Returns:
The selected action.
- classmethod load(path: str) PPOAgent
Loads the state of the agent from the specified file path.
- Parameters:
path – Path to a file from which to load the agent state.
- Returns:
A loaded instance of
PPOAgent
.
- reset_exploration(value)
Resets the exploration parameter to the specified value.
Note
PPOAgent
currently does not provide implementation for this method.
- save(path: str) str
Saves the state of the agent to the specified file.
- Parameters:
path – Path to a file to which the agent state will be saved.
- Returns:
An absolute path to the saved file.
- update(state: Any, action: int, reward: float, new_state: Any, is_terminal: bool) None
Updates the PPOAgent by saving the provided transition into its buffer. If the buffer is full it will also perform training on the actor and critic networks and clear the buffer.
- Parameters:
state – Current state of the environment.
action – Action taken in the current state.
reward – Reward received after taking the action.
new_state – Next state of the environment after taking the action. Note that PPOAgent does not actually use this value when updating.
is_terminal – A flag indicating whether the new state is a terminal state.
- update_exploration()
Updates the exploration parameter.
Note
PPOAgent
currently does not provide implementation for this method.
- class academia.agents.QLAgent(n_actions, alpha=0.1, gamma=0.99, epsilon=1, epsilon_decay=0.999, min_epsilon=0.01, random_state: int | None = None)
Bases:
TabularAgent
QLAgent class implements a Q-learning algorithm for tabular environments.
This agent learns to make decisions in an environment with discrete states and actions by maintaining a Q-table, which represents the quality of taking a certain action in a specific state.
- Parameters:
n_actions – Number of possible actions in the environment.
alpha – Learning rate. Defaults to 0.1.
gamma – Discount factor. Defaults to 0.99.
epsilon – Exploration-exploitation trade-off parameter. Defaults to 1.
epsilon_decay – Decay rate for epsilon. Defaults to 0.999.
min_epsilon – Minimum value for epsilon during exploration. Defaults to 0.01.
random_state – Seed for the random number generator. Defaults to
None
.
- Raises:
ValueError – If the given state is not supported.
- epsilon
Exploration-exploitation trade-off parameter.
- Type:
float
- min_epsilon
Minimum value for epsilon during exploration.
- Type:
float
- epsilon_decay
Decay rate for epsilon.
- Type:
float
- n_actions
Number of possible actions in the environment.
- Type:
int
- gamma
Discount factor.
- Type:
float
- alpha
Learning rate.
- Type:
float
- q_table
Q-table for the agent.
- Type:
dict
- update(state: Any, action: int, reward: float, new_state: Any, is_terminal: bool)
Updates the Q-value for the given state-action pair based on the observed reward and new state according to update strategy defined in Q-learning algorithm.
- Parameters:
state – Current state in the environment.
action – Action taken in the current state.
reward – Reward received after taking the action.
new_state – New state observed after taking the action.
is_terminal – Whether the new state is a terminal state or not.
- class academia.agents.SarsaAgent(n_actions, alpha=0.1, gamma=0.99, epsilon=1, epsilon_decay=0.999, min_epsilon=0.01, random_state: int | None = None)
Bases:
TabularAgent
SarsaAgent class implements a SARSA (State-Action-Reward-State-Action) learning algorithm for tabular environments.
This agent learns to make decisions in an environment with discrete states and actions by maintaining a Q-table, which represents the quality of taking a certain action in a specific state. SARSA updates its Q-values based on the current action and the action actually taken in the next state.
- Parameters:
n_actions – Number of possible actions in the environment.
alpha – Learning rate. Defaults to 0.1.
gamma – Discount factor. Defaults to 0.99.
epsilon – Exploration-exploitation trade-off parameter. Defaults to 1.
epsilon_decay – Decay rate for epsilon. Defaults to 0.999.
min_epsilon – Minimum value for epsilon during exploration. Defaults to 0.01.
random_state – Seed for the random number generator. Defaults to
None
.
- Raises:
ValueError – If the given state is not supported.
- epsilon
Exploration-exploitation trade-off parameter.
- Type:
float
- min_epsilon
Minimum value for epsilon during exploration.
- Type:
float
- epsilon_decay
Decay rate for epsilon.
- Type:
float
- n_actions
Number of possible actions in the environment.
- Type:
int
- gamma
Discount factor.
- Type:
float
- alpha
Learning rate.
- Type:
float
- q_table
Q-table for the agent.
- Type:
dict
- update(state: Any, action: int, reward: float, new_state: Any, is_terminal: bool)
Updates the Q-value for the given state-action pair based on the observed reward, new state, and the action taken in the new state.
- Parameters:
state – Current state in the environment.
action – Action taken in the current state.
reward – Reward received after taking the action.
new_state – New state observed after taking the action.
is_terminal – Whether the new state is a terminal state or not.