academia.agents module

Submodules

academia.agents.base module

Module contents

This module contains implementations of reinforcement learning algorithms, including tabular and those based on neural networks.

Exported classes:

QLAgent
SarsaAgent
DQNAgent
PPOAgent

Note

DQNAgent and PPOAgent need to be provided network architectures when initializing. These network architectures should be subclasses of torch.nn.Module. Example architectures can be found in academia.utils.models.

class academia.agents.DQNAgent(nn_architecture: Type[Module], n_actions: int, gamma: float = 0.99, epsilon: float = 1.0, epsilon_decay: float = 0.995, min_epsilon: float = 0.01, batch_size: int = 64, random_state: int | None = None, replay_memory_size: int = 100000, lr: float = 0.0005, tau: float = 0.001, update_every: int = 3, device: Literal['cpu', 'cuda'] = 'cpu')

Bases: EpsilonGreedyAgent

Class representing a Deep Q-Network (DQN) agent for reinforcement learning tasks.

The DQNAgent class implements the Deep Q-Network (DQN) algorithm for reinforcement learning tasks. It uses a neural network to approximate the Q-values of actions in a given environment. The agent learns from experiences stored in a replay memory and performs updates to its Q-values during training episodes. The target network is soft updated to stabilize training.

Parameters:

nn_architecture – Type of neural network architecture to be used.
n_actions – Number of possible actions in the environment.
gamma – Discount factor for future rewards. Defaults to 0.99.
epsilon – Initial exploration-exploitation trade-off parameter. Defaults to 1.0.
epsilon_decay – Decay factor for epsilon over time. Defaults to 0.995.
min_epsilon – Minimum epsilon value to ensure exploration. Defaults to 0.01.
batch_size – Size of the mini-batch used for training. Defaults to 64.
random_state – Seed for random number generation. Defaults to None.
replay_memory_size – Maximum size of the replay memory. Defaults to 100000.
lr – Learning rate for the optimizer. Defaults to 0.0005.
tau – Interpolation parameter for target network soft updates. Defaults to 0.001.
update_every – Frequency of network updates. Defaults to 3.
device – Device to use for training. Defaults to cpu.

nn_architecture

Type of neural network architecture to be used.

Type:: Type[nn.Module]

epsilon

Exploration-exploitation trade-off parameter.

Type:: float

min_epsilon

Minimum value for epsilon during exploration.

Type:: float

epsilon_decay

Decay rate for epsilon.

Type:: float

n_actions

Number of possible actions in the environment.

Type:: int

gamma

Discount factor.

Type:: float

memory

Replay memory used to store experiences for training.

Type:: deque

batch_size

Size of the mini-batch used for training.

Type:: int

network

Neural network used to approximate Q-values.

Type:: nn.Module

target_network

Target network used to stabilize training.

Type:: nn.Module

optimizer

Optimizer used for training.

Type:: optim.Optimizer

experience

Named tuple representing an experience tuple which stores state, action, reward, new_state, and done.

Type:: namedtuple

train_step

Counter for the number of training steps performed.

Type:: int

replay_memory_size

Maximum size of the replay memory.

Type:: int

lr

Learning rate for the optimizer.

Type:: float

tau

Interpolation parameter for target network soft updates.

Type:: float

update_every

Frequency of network updates.

Type:: int

device

Device used for training.

Type:: Literal[‘cpu’, ‘cuda’]

Examples

>>> from academia.agents import DQNAgent
>>> from academia.environments import DoorKey
>>> # Import custom neural network architecture
>>> from academia.utils.models import door_key
>>>
>>> # Create an environment:
>>> env = DoorKey(difficulty=0, append_step_count=True)
>>> # Create an instance of the DQNAgent class with
>>> # custom neural network architecture
>>> dqn_agent = DQNAgent(
>>>     nn_architecture=door_key.MLPStepDQN,
>>>     n_actions=DoorKey.N_ACTIONS,
>>>     gamma=0.99,
>>>     epsilon=1.0,
>>>     epsilon_decay=0.99,
>>>     min_epsilon=0.01,
>>>     batch_size=64,
>>> )
>>> # Training loop: Update the agent using experiences
>>> # (state, action, reward, new_state, done)
>>> num_episodes = 100
>>> for episode in range(num_episodes):
>>>    state = env.reset()
>>>    done = False
>>>    while not done:
>>>        action = dqn_agent.get_action(state)
>>>        new_state, reward, terminated = env.step(action)
>>>        if terminated:
>>>            done = True
>>>        dqn_agent.update(state, action, reward, new_state, done)
>>>        state = new_state
>>>
>>> # Save the agent's state dictionary to a file
>>> dqn_agent.save('dqn_agent')
>>>
>>> # Load the agent's state dictionary from a file
>>> dqn_agent = DQNAgent.load('dqn_agent')

Note

Ensure that the custom neural network architecture passed to the constructor inherits from torch.nn.Module and is appropriate for the task.
The agent’s exploration-exploitation strategy is based on epsilon-greedy method.
The __soft_update_target() method updates the target network weights from the main network’s weights based on strategy target_weights = tau * main_weights + (1 - tau) * target_weights, where tau << 1.
It is recommended to adjust hyperparameters such as gamma, epsilon, epsilon_decay, and batch_size based on the specific task and environment.

get_action(state: Any, legal_mask: ndarray[Any, dtype[int32]] | None = None, greedy: bool = False) → int

Selects an action based on the current state using the epsilon-greedy strategy.

Parameters:

state – The current state representation used to make the action selection decision.
legal_mask – A binary mask indicating the legality of actions. If provided, restricts the agent’s choices to legal actions.
greedy – A boolean flag indicating whether to force a greedy action selection. If True, the function always chooses the action with the highest Q-value, ignoring exploration.

Returns:

The index of the selected action.

classmethod load(path: str) → DQNAgent

Loads the state dictionary of the neural network model, target network model and agent parameters from the specified file.

Parameters:: path – Path to a file from which to load the model’s state dictionary.
Returns:: A loaded instance of PPOAgent.

save(path: str) → str

Saves the state dictionary of the neural network model, target network model and agent parameters to the specified file path.

Parameters:: path – Path to a file (including filename and extension) where the model’s state dictionary will be saved.
Returns:: An absolute path to the saved file.

update(state: Any, action: int, reward: float, new_state: Any, is_terminal: bool)

Updates the DQN network weights to better estimate Q-values of every action.

Parameters:

state – Current state of the environment.
action – Action taken in the current state.
reward – Reward received after taking the action.
new_state – Next state of the environment after taking the action.
is_terminal – A flag indicating whether the new state is a terminal state.

class academia.agents.PPOAgent(actor_architecture: Type[Module], critic_architecture: Type[Module], n_actions: int, discrete: bool = True, batch_size: int = 64, n_epochs: int = 5, n_steps: int | None = None, n_episodes: int | None = 10, clip: float = 0.2, lr: float = 0.0003, covariance_fill: float = 0.5, entropy_coefficient: float = 0.01, gamma: float = 0.99, random_state: int | None = None, device: Literal['cpu', 'cuda'] = 'cpu')

Bases: Agent

Class representing a Proximal Policy Optimization (PPO) agent for reinforcement learning tasks. Paper on PPO: https://arxiv.org/pdf/1707.06347.pdf

Parameters:

actor_architecture – Type of neural network architecture to be used for the actor.
critic_architecture – Type of neural network architecture to be used. for the critic.
n_actions – Number of possible actions in the environment.
discrete – Whether the agent’s action space is discrete. Defaults to True.
batch_size – The size of the minibatch used during training. Defaults to 64.
n_epochs – Number of epochs per training. Defaults to 5.
n_steps – Minimum number of steps to take between training sessions. Note that if the minimum is reached during an episode the episode will still finish and the remaining steps will be included in the buffer. If set to None n_episodes will be used instead. Exactly one of n_steps and n_episodes must be not None. Defaults to None.
n_episodes – Number of episodes to take between training sessions. If set to None n_steps will be used instead. Exactly one of n_steps and n_episodes must be not None. Defaults to 10.
clip – Clip rate hyperparameter from the PPO algorithm. Defaults to 0.2.
lr – Learning rate used by (Adam) optimizers. The same value is used for both actor and critic. Defaults to 3e-4.
covariance_fill – Value on the diagonal in the covariance matrix used to randomly sample continuous actions when discrete is False. Defaults to 0.5.
entropy_coefficient – Coefficient used to control the impact of entropy on the loss function. Defaults to 0.01.
gamma – Discount factor for future rewards. Defaults to 0.99.
random_state – Seed for random number generation. Defaults to None.
device – Device used for computation. Possible values are 'cuda' and 'cpu'. Defaults to 'cpu'.

n_actions

Number of possible actions in the environment.

Type:: int

gamma

Discount factor for future rewards.

Type:: float

discrete

Whether the agent’s action space is discrete.

Type:: bool

clip

Clip rate hyperparameter from the PPO algorithm.

Type:: float

lr

Learning rate used by (Adam) optimizers.

Type:: float

entropy_coefficient

Coefficient used to control the impact of entropy on the loss function.

Type:: float

batch_size

The size of the minibatch used during training.

Type:: int

n_epochs

Number of epochs per training.

Type:: int

device

Device used for computation.

Type:: Literal[‘cpu’, ‘cuda’]

buffer

Trajectory buffer. This object contains all transitions gathered since the last training session.

Type:: PPOAgent.PPOBuffer

actor

Actor neural network.

Type:: nn.Module

critic

Critic neural network.

Type:: nn.Module

actor_architecture

Type of neural network architecture to be used for the actor.

Type:: Type[nn.Module]

critic_architecture

Type of neural network architecture to be used for the critic.

Type:: Type[nn.Module]

Examples

>>> from academia.agents import PPOAgent
>>> from academia.environments import LavaCrossing
>>> from academia.curriculum import LearningTask
>>> from academia.utils.models import lava_crossing
>>>
>>> task = LearningTask(
>>>     LavaCrossing,
>>>     env_args={'difficulty': 0},
>>>     stop_conditions={'max_episodes': 100}
>>> )
>>> agent = PPOAgent(
>>>     actor_architecture=lava_crossing.MLPActor,
>>>     critic_architecture=lava_crossing.MLPCritic,
>>>     n_actions=3
>>> )
>>> task.run(agent)

Note

PPOAgent currently does not support legal masks.
PPOAgent currently does not provide implementations for update_exploration() or reset_exploration() methods.

class PPOBuffer(n_steps: int | None = None, n_episodes: int | None = None)

Bases: object

Class representing the buffer of PPOAgent

Parameters:

n_steps – Minimum number of steps to take between training sessions. Note that if the minimum is reached during an episode the episode will still finish and the remaining steps will be included in the buffer. If set to None n_episodes will be used instead. Exactly one of n_steps and n_episodes must be not None. Defaults to None.
n_episodes – Number of episodes to take between training sessions. If set to None n_steps will be used instead. Exactly one of n_steps and n_episodes must be not None. Defaults to None.

n_steps

Minimum number of steps to take between training sessions.

Type:: int

n_episodes

Number of episodes to take between training sessions.

Type:: int

episode_length_counter

Length of the currently running episode.

Type:: int

steps_counter

Number of steps stored inside the buffer.

Type:: int

episode_counter

Number of full episodes stored inside the buffer.

Type:: int

states

List containing observed states.

Type:: list

actions

List containing actions taken.

Type:: list

actions_logits

List containing logits of actions taken.

Type:: list

rewards

List of obtained rewards.

Type:: list

rewards_to_go

List of discounted rewards. Note that it is only calculated right before the training and is cleared afterwards.

Type:: list

episode_lengths

List containing the lengths of buffered episodes.

Type:: list

calculate_rewards_to_go(gamma: float) → None

Calculates the discounted rewards for each buffered episode.

Parameters:: gamma – Discount factor

get_tensors(device: Literal['cpu', 'cuda']) → Tuple[FloatTensor, FloatTensor, FloatTensor, FloatTensor]

Calculates the discounted rewards for each buffered episode

Parameters:: device – Target computation device
Returns:: A 4-element tuple containing states, actions, actions logits and discounted rewards in that order converted to tensors.

reset() → None: Clears the buffer and resets it to the initial state.

update(state: Any, action: Any, action_logit: float, reward: float, is_terminal: bool) → bool

Updates the buffer with the provided transition attriutes.

Parameters:

state – Observed state of the environment.
action – Action taken by the agent.
action_logit – Logit of the action taken by the agent.
reward – Reward obtained by the agent.
is_terminal – Whether the resulting new state is terminal.

Returns:

Whether the buffer is full and the current episode is terminated.

get_action(state: Any, legal_mask: ndarray[Any, dtype[int32]] | None = None, greedy: bool = False) → float | int

Selects an action based on the current state.

Parameters:

state – The current state representation used to make the action selection decision.
legal_mask – A binary mask indicating the legality of actions. If provided, restricts the agent’s choices to legal actions. Note that currently PPOAgent does not support legal masks.
greedy – A boolean flag indicating whether to force a greedy action selection.

Returns:

The selected action.

classmethod load(path: str) → PPOAgent

Loads the state of the agent from the specified file path.

Parameters:: path – Path to a file from which to load the agent state.
Returns:: A loaded instance of PPOAgent.

reset_exploration(value): Resets the exploration parameter to the specified value.

Note

PPOAgent currently does not provide implementation for this method.

save(path: str) → str

Saves the state of the agent to the specified file.

Parameters:: path – Path to a file to which the agent state will be saved.
Returns:: An absolute path to the saved file.

update(state: Any, action: int, reward: float, new_state: Any, is_terminal: bool) → None

Updates the PPOAgent by saving the provided transition into its buffer. If the buffer is full it will also perform training on the actor and critic networks and clear the buffer.

Parameters:

state – Current state of the environment.
action – Action taken in the current state.
reward – Reward received after taking the action.
new_state – Next state of the environment after taking the action. Note that PPOAgent does not actually use this value when updating.
is_terminal – A flag indicating whether the new state is a terminal state.

update_exploration(): Updates the exploration parameter.

Note

PPOAgent currently does not provide implementation for this method.

class academia.agents.QLAgent(n_actions, alpha=0.1, gamma=0.99, epsilon=1, epsilon_decay=0.999, min_epsilon=0.01, random_state: int | None = None)

Bases: TabularAgent

QLAgent class implements a Q-learning algorithm for tabular environments.

This agent learns to make decisions in an environment with discrete states and actions by maintaining a Q-table, which represents the quality of taking a certain action in a specific state.

Parameters:

n_actions – Number of possible actions in the environment.
alpha – Learning rate. Defaults to 0.1.
gamma – Discount factor. Defaults to 0.99.
epsilon – Exploration-exploitation trade-off parameter. Defaults to 1.
epsilon_decay – Decay rate for epsilon. Defaults to 0.999.
min_epsilon – Minimum value for epsilon during exploration. Defaults to 0.01.
random_state – Seed for the random number generator. Defaults to None.

Raises:

ValueError – If the given state is not supported.

epsilon

Exploration-exploitation trade-off parameter.

Type:: float

min_epsilon

Minimum value for epsilon during exploration.

Type:: float

epsilon_decay

Decay rate for epsilon.

Type:: float

n_actions

Number of possible actions in the environment.

Type:: int

gamma

Discount factor.

Type:: float

alpha

Learning rate.

Type:: float

q_table

Q-table for the agent.

Type:: dict

update(state: Any, action: int, reward: float, new_state: Any, is_terminal: bool)

Updates the Q-value for the given state-action pair based on the observed reward and new state according to update strategy defined in Q-learning algorithm.

Parameters:

state – Current state in the environment.
action – Action taken in the current state.
reward – Reward received after taking the action.
new_state – New state observed after taking the action.
is_terminal – Whether the new state is a terminal state or not.

class academia.agents.SarsaAgent(n_actions, alpha=0.1, gamma=0.99, epsilon=1, epsilon_decay=0.999, min_epsilon=0.01, random_state: int | None = None)

Bases: TabularAgent

SarsaAgent class implements a SARSA (State-Action-Reward-State-Action) learning algorithm for tabular environments.

This agent learns to make decisions in an environment with discrete states and actions by maintaining a Q-table, which represents the quality of taking a certain action in a specific state. SARSA updates its Q-values based on the current action and the action actually taken in the next state.

Parameters:

n_actions – Number of possible actions in the environment.
alpha – Learning rate. Defaults to 0.1.
gamma – Discount factor. Defaults to 0.99.
epsilon – Exploration-exploitation trade-off parameter. Defaults to 1.
epsilon_decay – Decay rate for epsilon. Defaults to 0.999.
min_epsilon – Minimum value for epsilon during exploration. Defaults to 0.01.
random_state – Seed for the random number generator. Defaults to None.

Raises:

ValueError – If the given state is not supported.

epsilon

Exploration-exploitation trade-off parameter.

Type:: float

min_epsilon

Minimum value for epsilon during exploration.

Type:: float

epsilon_decay

Decay rate for epsilon.

Type:: float

n_actions

Number of possible actions in the environment.

Type:: int

gamma

Discount factor.

Type:: float

alpha

Learning rate.

Type:: float

q_table

Q-table for the agent.

Type:: dict

update(state: Any, action: int, reward: float, new_state: Any, is_terminal: bool)

Updates the Q-value for the given state-action pair based on the observed reward, new state, and the action taken in the new state.

Parameters:

state – Current state in the environment.
action – Action taken in the current state.
reward – Reward received after taking the action.
new_state – New state observed after taking the action.
is_terminal – Whether the new state is a terminal state or not.