academia.agents module

Submodules

Module contents

This module contains implementations of reinforcement learning algorithms, including tabular and those based on neural networks.

Exported classes:

Note

DQNAgent and PPOAgent need to be provided network architectures when initializing. These network architectures should be subclasses of torch.nn.Module. Example architectures can be found in academia.utils.models.

class academia.agents.DQNAgent(nn_architecture: Type[Module], n_actions: int, gamma: float = 0.99, epsilon: float = 1.0, epsilon_decay: float = 0.995, min_epsilon: float = 0.01, batch_size: int = 64, random_state: int | None = None, replay_memory_size: int = 100000, lr: float = 0.0005, tau: float = 0.001, update_every: int = 3, device: Literal['cpu', 'cuda'] = 'cpu')

Bases: EpsilonGreedyAgent

Class representing a Deep Q-Network (DQN) agent for reinforcement learning tasks.

The DQNAgent class implements the Deep Q-Network (DQN) algorithm for reinforcement learning tasks. It uses a neural network to approximate the Q-values of actions in a given environment. The agent learns from experiences stored in a replay memory and performs updates to its Q-values during training episodes. The target network is soft updated to stabilize training.

Parameters:
  • nn_architecture – Type of neural network architecture to be used.

  • n_actions – Number of possible actions in the environment.

  • gamma – Discount factor for future rewards. Defaults to 0.99.

  • epsilon – Initial exploration-exploitation trade-off parameter. Defaults to 1.0.

  • epsilon_decay – Decay factor for epsilon over time. Defaults to 0.995.

  • min_epsilon – Minimum epsilon value to ensure exploration. Defaults to 0.01.

  • batch_size – Size of the mini-batch used for training. Defaults to 64.

  • random_state – Seed for random number generation. Defaults to None.

  • replay_memory_size – Maximum size of the replay memory. Defaults to 100000.

  • lr – Learning rate for the optimizer. Defaults to 0.0005.

  • tau – Interpolation parameter for target network soft updates. Defaults to 0.001.

  • update_every – Frequency of network updates. Defaults to 3.

  • device – Device to use for training. Defaults to cpu.

nn_architecture

Type of neural network architecture to be used.

Type:

Type[nn.Module]

epsilon

Exploration-exploitation trade-off parameter.

Type:

float

min_epsilon

Minimum value for epsilon during exploration.

Type:

float

epsilon_decay

Decay rate for epsilon.

Type:

float

n_actions

Number of possible actions in the environment.

Type:

int

gamma

Discount factor.

Type:

float

memory

Replay memory used to store experiences for training.

Type:

deque

batch_size

Size of the mini-batch used for training.

Type:

int

network

Neural network used to approximate Q-values.

Type:

nn.Module

target_network

Target network used to stabilize training.

Type:

nn.Module

optimizer

Optimizer used for training.

Type:

optim.Optimizer

experience

Named tuple representing an experience tuple which stores state, action, reward, new_state, and done.

Type:

namedtuple

train_step

Counter for the number of training steps performed.

Type:

int

replay_memory_size

Maximum size of the replay memory.

Type:

int

lr

Learning rate for the optimizer.

Type:

float

tau

Interpolation parameter for target network soft updates.

Type:

float

update_every

Frequency of network updates.

Type:

int

device

Device used for training.

Type:

Literal[‘cpu’, ‘cuda’]

Examples

>>> from academia.agents import DQNAgent
>>> from academia.environments import DoorKey
>>> # Import custom neural network architecture
>>> from academia.utils.models import door_key
>>>
>>> # Create an environment:
>>> env = DoorKey(difficulty=0, append_step_count=True)
>>> # Create an instance of the DQNAgent class with
>>> # custom neural network architecture
>>> dqn_agent = DQNAgent(
>>>     nn_architecture=door_key.MLPStepDQN,
>>>     n_actions=DoorKey.N_ACTIONS,
>>>     gamma=0.99,
>>>     epsilon=1.0,
>>>     epsilon_decay=0.99,
>>>     min_epsilon=0.01,
>>>     batch_size=64,
>>> )
>>> # Training loop: Update the agent using experiences
>>> # (state, action, reward, new_state, done)
>>> num_episodes = 100
>>> for episode in range(num_episodes):
>>>    state = env.reset()
>>>    done = False
>>>    while not done:
>>>        action = dqn_agent.get_action(state)
>>>        new_state, reward, terminated = env.step(action)
>>>        if terminated:
>>>            done = True
>>>        dqn_agent.update(state, action, reward, new_state, done)
>>>        state = new_state
>>>
>>> # Save the agent's state dictionary to a file
>>> dqn_agent.save('dqn_agent')
>>>
>>> # Load the agent's state dictionary from a file
>>> dqn_agent = DQNAgent.load('dqn_agent')

Note

  • Ensure that the custom neural network architecture passed to the constructor inherits from torch.nn.Module and is appropriate for the task.

  • The agent’s exploration-exploitation strategy is based on epsilon-greedy method.

  • The __soft_update_target() method updates the target network weights from the main network’s weights based on strategy target_weights = tau * main_weights + (1 - tau) * target_weights, where tau << 1.

  • It is recommended to adjust hyperparameters such as gamma, epsilon, epsilon_decay, and batch_size based on the specific task and environment.

get_action(state: Any, legal_mask: ndarray[Any, dtype[int32]] | None = None, greedy: bool = False) int

Selects an action based on the current state using the epsilon-greedy strategy.

Parameters:
  • state – The current state representation used to make the action selection decision.

  • legal_mask – A binary mask indicating the legality of actions. If provided, restricts the agent’s choices to legal actions.

  • greedy – A boolean flag indicating whether to force a greedy action selection. If True, the function always chooses the action with the highest Q-value, ignoring exploration.

Returns:

The index of the selected action.

classmethod load(path: str) DQNAgent

Loads the state dictionary of the neural network model, target network model and agent parameters from the specified file.

Parameters:

path – Path to a file from which to load the model’s state dictionary.

Returns:

A loaded instance of PPOAgent.

save(path: str) str

Saves the state dictionary of the neural network model, target network model and agent parameters to the specified file path.

Parameters:

path – Path to a file (including filename and extension) where the model’s state dictionary will be saved.

Returns:

An absolute path to the saved file.

update(state: Any, action: int, reward: float, new_state: Any, is_terminal: bool)

Updates the DQN network weights to better estimate Q-values of every action.

Parameters:
  • state – Current state of the environment.

  • action – Action taken in the current state.

  • reward – Reward received after taking the action.

  • new_state – Next state of the environment after taking the action.

  • is_terminal – A flag indicating whether the new state is a terminal state.

class academia.agents.PPOAgent(actor_architecture: Type[Module], critic_architecture: Type[Module], n_actions: int, discrete: bool = True, batch_size: int = 64, n_epochs: int = 5, n_steps: int | None = None, n_episodes: int | None = 10, clip: float = 0.2, lr: float = 0.0003, covariance_fill: float = 0.5, entropy_coefficient: float = 0.01, gamma: float = 0.99, random_state: int | None = None, device: Literal['cpu', 'cuda'] = 'cpu')

Bases: Agent

Class representing a Proximal Policy Optimization (PPO) agent for reinforcement learning tasks. Paper on PPO: https://arxiv.org/pdf/1707.06347.pdf

Parameters:
  • actor_architecture – Type of neural network architecture to be used for the actor.

  • critic_architecture – Type of neural network architecture to be used. for the critic.

  • n_actions – Number of possible actions in the environment.

  • discrete – Whether the agent’s action space is discrete. Defaults to True.

  • batch_size – The size of the minibatch used during training. Defaults to 64.

  • n_epochs – Number of epochs per training. Defaults to 5.

  • n_steps – Minimum number of steps to take between training sessions. Note that if the minimum is reached during an episode the episode will still finish and the remaining steps will be included in the buffer. If set to None n_episodes will be used instead. Exactly one of n_steps and n_episodes must be not None. Defaults to None.

  • n_episodes – Number of episodes to take between training sessions. If set to None n_steps will be used instead. Exactly one of n_steps and n_episodes must be not None. Defaults to 10.

  • clip – Clip rate hyperparameter from the PPO algorithm. Defaults to 0.2.

  • lr – Learning rate used by (Adam) optimizers. The same value is used for both actor and critic. Defaults to 3e-4.

  • covariance_fill – Value on the diagonal in the covariance matrix used to randomly sample continuous actions when discrete is False. Defaults to 0.5.

  • entropy_coefficient – Coefficient used to control the impact of entropy on the loss function. Defaults to 0.01.

  • gamma – Discount factor for future rewards. Defaults to 0.99.

  • random_state – Seed for random number generation. Defaults to None.

  • device – Device used for computation. Possible values are 'cuda' and 'cpu'. Defaults to 'cpu'.

n_actions

Number of possible actions in the environment.

Type:

int

gamma

Discount factor for future rewards.

Type:

float

discrete

Whether the agent’s action space is discrete.

Type:

bool

clip

Clip rate hyperparameter from the PPO algorithm.

Type:

float

lr

Learning rate used by (Adam) optimizers.

Type:

float

entropy_coefficient

Coefficient used to control the impact of entropy on the loss function.

Type:

float

batch_size

The size of the minibatch used during training.

Type:

int

n_epochs

Number of epochs per training.

Type:

int

device

Device used for computation.

Type:

Literal[‘cpu’, ‘cuda’]

buffer

Trajectory buffer. This object contains all transitions gathered since the last training session.

Type:

PPOAgent.PPOBuffer

actor

Actor neural network.

Type:

nn.Module

critic

Critic neural network.

Type:

nn.Module

actor_architecture

Type of neural network architecture to be used for the actor.

Type:

Type[nn.Module]

critic_architecture

Type of neural network architecture to be used for the critic.

Type:

Type[nn.Module]

Examples

>>> from academia.agents import PPOAgent
>>> from academia.environments import LavaCrossing
>>> from academia.curriculum import LearningTask
>>> from academia.utils.models import lava_crossing
>>>
>>> task = LearningTask(
>>>     LavaCrossing,
>>>     env_args={'difficulty': 0},
>>>     stop_conditions={'max_episodes': 100}
>>> )
>>> agent = PPOAgent(
>>>     actor_architecture=lava_crossing.MLPActor,
>>>     critic_architecture=lava_crossing.MLPCritic,
>>>     n_actions=3
>>> )
>>> task.run(agent)

Note

class PPOBuffer(n_steps: int | None = None, n_episodes: int | None = None)

Bases: object

Class representing the buffer of PPOAgent

Parameters:
  • n_steps – Minimum number of steps to take between training sessions. Note that if the minimum is reached during an episode the episode will still finish and the remaining steps will be included in the buffer. If set to None n_episodes will be used instead. Exactly one of n_steps and n_episodes must be not None. Defaults to None.

  • n_episodes – Number of episodes to take between training sessions. If set to None n_steps will be used instead. Exactly one of n_steps and n_episodes must be not None. Defaults to None.

n_steps

Minimum number of steps to take between training sessions.

Type:

int

n_episodes

Number of episodes to take between training sessions.

Type:

int

episode_length_counter

Length of the currently running episode.

Type:

int

steps_counter

Number of steps stored inside the buffer.

Type:

int

episode_counter

Number of full episodes stored inside the buffer.

Type:

int

states

List containing observed states.

Type:

list

actions

List containing actions taken.

Type:

list

actions_logits

List containing logits of actions taken.

Type:

list

rewards

List of obtained rewards.

Type:

list

rewards_to_go

List of discounted rewards. Note that it is only calculated right before the training and is cleared afterwards.

Type:

list

episode_lengths

List containing the lengths of buffered episodes.

Type:

list

calculate_rewards_to_go(gamma: float) None

Calculates the discounted rewards for each buffered episode.

Parameters:

gamma – Discount factor

get_tensors(device: Literal['cpu', 'cuda']) Tuple[FloatTensor, FloatTensor, FloatTensor, FloatTensor]

Calculates the discounted rewards for each buffered episode

Parameters:

device – Target computation device

Returns:

A 4-element tuple containing states, actions, actions logits and discounted rewards in that order converted to tensors.

reset() None

Clears the buffer and resets it to the initial state.

update(state: Any, action: Any, action_logit: float, reward: float, is_terminal: bool) bool

Updates the buffer with the provided transition attriutes.

Parameters:
  • state – Observed state of the environment.

  • action – Action taken by the agent.

  • action_logit – Logit of the action taken by the agent.

  • reward – Reward obtained by the agent.

  • is_terminal – Whether the resulting new state is terminal.

Returns:

Whether the buffer is full and the current episode is terminated.

get_action(state: Any, legal_mask: ndarray[Any, dtype[int32]] | None = None, greedy: bool = False) float | int

Selects an action based on the current state.

Parameters:
  • state – The current state representation used to make the action selection decision.

  • legal_mask – A binary mask indicating the legality of actions. If provided, restricts the agent’s choices to legal actions. Note that currently PPOAgent does not support legal masks.

  • greedy – A boolean flag indicating whether to force a greedy action selection.

Returns:

The selected action.

classmethod load(path: str) PPOAgent

Loads the state of the agent from the specified file path.

Parameters:

path – Path to a file from which to load the agent state.

Returns:

A loaded instance of PPOAgent.

reset_exploration(value)

Resets the exploration parameter to the specified value.

Note

PPOAgent currently does not provide implementation for this method.

save(path: str) str

Saves the state of the agent to the specified file.

Parameters:

path – Path to a file to which the agent state will be saved.

Returns:

An absolute path to the saved file.

update(state: Any, action: int, reward: float, new_state: Any, is_terminal: bool) None

Updates the PPOAgent by saving the provided transition into its buffer. If the buffer is full it will also perform training on the actor and critic networks and clear the buffer.

Parameters:
  • state – Current state of the environment.

  • action – Action taken in the current state.

  • reward – Reward received after taking the action.

  • new_state – Next state of the environment after taking the action. Note that PPOAgent does not actually use this value when updating.

  • is_terminal – A flag indicating whether the new state is a terminal state.

update_exploration()

Updates the exploration parameter.

Note

PPOAgent currently does not provide implementation for this method.

class academia.agents.QLAgent(n_actions, alpha=0.1, gamma=0.99, epsilon=1, epsilon_decay=0.999, min_epsilon=0.01, random_state: int | None = None)

Bases: TabularAgent

QLAgent class implements a Q-learning algorithm for tabular environments.

This agent learns to make decisions in an environment with discrete states and actions by maintaining a Q-table, which represents the quality of taking a certain action in a specific state.

Parameters:
  • n_actions – Number of possible actions in the environment.

  • alpha – Learning rate. Defaults to 0.1.

  • gamma – Discount factor. Defaults to 0.99.

  • epsilon – Exploration-exploitation trade-off parameter. Defaults to 1.

  • epsilon_decay – Decay rate for epsilon. Defaults to 0.999.

  • min_epsilon – Minimum value for epsilon during exploration. Defaults to 0.01.

  • random_state – Seed for the random number generator. Defaults to None.

Raises:

ValueError – If the given state is not supported.

epsilon

Exploration-exploitation trade-off parameter.

Type:

float

min_epsilon

Minimum value for epsilon during exploration.

Type:

float

epsilon_decay

Decay rate for epsilon.

Type:

float

n_actions

Number of possible actions in the environment.

Type:

int

gamma

Discount factor.

Type:

float

alpha

Learning rate.

Type:

float

q_table

Q-table for the agent.

Type:

dict

update(state: Any, action: int, reward: float, new_state: Any, is_terminal: bool)

Updates the Q-value for the given state-action pair based on the observed reward and new state according to update strategy defined in Q-learning algorithm.

Parameters:
  • state – Current state in the environment.

  • action – Action taken in the current state.

  • reward – Reward received after taking the action.

  • new_state – New state observed after taking the action.

  • is_terminal – Whether the new state is a terminal state or not.

class academia.agents.SarsaAgent(n_actions, alpha=0.1, gamma=0.99, epsilon=1, epsilon_decay=0.999, min_epsilon=0.01, random_state: int | None = None)

Bases: TabularAgent

SarsaAgent class implements a SARSA (State-Action-Reward-State-Action) learning algorithm for tabular environments.

This agent learns to make decisions in an environment with discrete states and actions by maintaining a Q-table, which represents the quality of taking a certain action in a specific state. SARSA updates its Q-values based on the current action and the action actually taken in the next state.

Parameters:
  • n_actions – Number of possible actions in the environment.

  • alpha – Learning rate. Defaults to 0.1.

  • gamma – Discount factor. Defaults to 0.99.

  • epsilon – Exploration-exploitation trade-off parameter. Defaults to 1.

  • epsilon_decay – Decay rate for epsilon. Defaults to 0.999.

  • min_epsilon – Minimum value for epsilon during exploration. Defaults to 0.01.

  • random_state – Seed for the random number generator. Defaults to None.

Raises:

ValueError – If the given state is not supported.

epsilon

Exploration-exploitation trade-off parameter.

Type:

float

min_epsilon

Minimum value for epsilon during exploration.

Type:

float

epsilon_decay

Decay rate for epsilon.

Type:

float

n_actions

Number of possible actions in the environment.

Type:

int

gamma

Discount factor.

Type:

float

alpha

Learning rate.

Type:

float

q_table

Q-table for the agent.

Type:

dict

update(state: Any, action: int, reward: float, new_state: Any, is_terminal: bool)

Updates the Q-value for the given state-action pair based on the observed reward, new state, and the action taken in the new state.

Parameters:
  • state – Current state in the environment.

  • action – Action taken in the current state.

  • reward – Reward received after taking the action.

  • new_state – New state observed after taking the action.

  • is_terminal – Whether the new state is a terminal state or not.