bnelearn.learner module

Implements multi-agent learning rules

class bnelearn.learner.AESPGLearner(model: NeuralNetStrategy, environment: Environment, hyperparams: dict, optimizer_type: Type[Optimizer], optimizer_hyperparams: dict, strat_to_player_kwargs: Optional[dict] = None)[source]

Bases: GradientBasedLearner

Implements Deterministic Policy Gradients http://proceedings.mlr.press/v32/silver14.pdf with ES-pseudogradients of dQ/da

class bnelearn.learner.DDPGLearner[source]

Bases: GradientBasedLearner

Implements Deep Deterministic Policy Gradients (Lilicrap et al 2016)

http://arxiv.org/abs/1509.02971

class bnelearn.learner.DPGLearner[source]

Bases: GradientBasedLearner

Implements Deterministic Policy Gradients

http://proceedings.mlr.press/v32/silver14.pdf

via directly calculating dQ/da and da/dtheta

class bnelearn.learner.DummyNonLearner(model: Module, environment: Environment, hyperparams: dict, optimizer_type: Type[Optimizer], optimizer_hyperparams: dict, strat_to_player_kwargs: Optional[dict] = None)[source]

Bases: GradientBasedLearner

A learner that does nothing.

class bnelearn.learner.ESPGLearner(hyperparams: dict, **kwargs)[source]

Bases: GradientBasedLearner

Neural Self-Play with Evolutionary Strategy Pseudo-PG as proposed in Bichler et. al (2021).

Uses pseudo-policy gradients calculated as

(rewards - baseline).mean() * epsilons / sigma**2

over a population of models perturbed by parameter noise epsilon yielding perturbed rewards.

Arguments:

model: bnelearn.bidder environment: bnelearn.Environment hyperparams: dict

(required:)

population_size: int sigma: float scale_sigma_by_model_size: bool

(optional:)
normalize_gradients: bool (default: False)

If true will scale rewards to N(0,1) in weighted-noise update: (F - baseline).mean()/sigma/F.std() resulting in an (approximately) normalized vector pointing in the same direction as the true gradient. (normalization requires small enough sigma!) If false or not provided, will approximate true gradient using current utility as a baseline for variance reduction.

baseline: (‘current_reward’, ‘mean_reward’ or a float.)

If ‘current_reward’, will use current utility before update as a baseline. If ‘mean_reward’, will use mean of candidate rewards.

For small perturbations, ‘mean_reward’ is cheaper to compute (one fewer game played) and yields slightly lower gradient sample variance but yields a biased estimate of the true gradient:

Expect(ES_grad with mean) = (pop_size - 1) / pop_size * true_grad

If a float is given, will use that float as reward. Defaults to ‘current_reward’ if normalize_gradients is False, or to ‘mean_reward’ if normalize_gradients is True.

regularization: dict of

initial_strength: float, initial penalization factor of bid value regularize_decay: float, decay rate by which the regularization factor

is multiplied each iteration.

symmetric_sampling: bool

whether or not we sample symmetric pairs of perturbed parameters, e.g. p + eps and p - eps.

optimizer_type: Type[torch.optim.Optimizer]

A class implementing torch’s optimizer interface used for parameter update step.

strat_to_player_kwargs: dict

dict of arguments provided to environment used for evaluating utility of current and candidate strategies.

class bnelearn.learner.GradientBasedLearner(model: Module, environment: Environment, optimizer_type: Type[Optimizer], optimizer_hyperparams: dict, scheduler_type: Optional[Type[_LRScheduler]] = None, scheduler_hyperparams: Optional[dict] = None, strat_to_player_kwargs: Optional[dict] = None, smooth_market: bool = False, log_gradient_variance: bool = False)[source]

Bases: Learner

A learning rule that is based on computing some version of (pseudo-) gradient, then applying an SGD-like update via a torch.optim.Optimizer

update_strategy(closure: Optional[Callable] = None) Tensor[source]

Performs one model-update to the player’s strategy.

Params:
closure: (optional) Callable that recomputes model loss.

Required by some optimizers such as LBFGS. When given, optimizer.step() (and thus this function) return the last evaluated loss. (Usually evaluated BEFORE the model update). For correct usage see: https://pytorch.org/docs/stable/optim.html#optimizer-step-closure

Returns: None or loss evaluated by closure. (See above.)

update_strategy_and_evaluate_utility(closure=None)[source]

updates model and returns utility after the update.

class bnelearn.learner.Learner[source]

Bases: ABC

A learning rule used to update a player’s policy in self-play

abstract update_strategy() None[source]

Updates the player’s strategy.

abstract update_strategy_and_evaluate_utility(closure=None) Tensor[source]

updates model and returns utility after the update.

class bnelearn.learner.PGLearner(hyperparams: dict, **kwargs)[source]

Bases: GradientBasedLearner

Neural Self-Play with directly computed Policy Gradients.

class bnelearn.learner.PSOLearner(model: Module, environment: Environment, hyperparams: dict, optimizer_type: Type[Optimizer], optimizer_hyperparams: dict, scheduler_type: Optional[Type[_LRScheduler]] = None, scheduler_hyperparams: Optional[dict] = None, strat_to_player_kwargs: Optional[dict] = None, smooth_market: bool = False, log_gradient_variance: bool = False)[source]

Bases: Learner

Implements the Particle Swarm Optimization Algorithm as a Learner Particles represent a possible solutions to the model parameters. Every update step they move one step in the search space to sample a new solution point. They are guided by their previously best found solution (personal best position) and the best solution found by the entire swarm (best position) NOTE: dim = number of parameters in the model to be optimized Arguments:

model: bnelearn.bidder environment: bnelearn.Environment hyperparams: dict

(required:)
swarm_size: int

Number of particles in the swarm

topology: str

Defines the communication network of the swarm If ‘global’, particles are drawn to the global best position of the swarm.

Neighborhood size = swarm size

If ‘ring’, particles are drawn to the best position in their neighborhood.

Particles form a neighborhood based on their position in the population array. The first and last particles are connected to form a ring structure. Neighborhood size = 3. E.g., neighborhood of particle i: particle i-1, particle i, particle i+1

If ‘von_neumann’, particles are drawn to the best position in their neighborhood.

Particles form a neighborhood based on their position in the population matrix. A particle is connected to its left, right, upper and lower neighbor in the matrix. Neighborhood size = 5

max_velocity: float

Max step size in each direction during one update step If velocity_clamping == False then only used for initialization

(optional:)

The default values for the inertia weight and the cognition & social ratio are commonly used values performing well form most problem settings. Based on: Clerc, M., & Kennedy, J. (2002) inertia_weight: float, List, Tuple (default: 0.792)

Scales the impact of the old velocity on the new one. If float, will set value as constant If List or Tuple, with lenght == 2, will take the first value as w_max and second as w_min for a linear decreasing inertia weight !!! max number of iteration is hardcoded to 2000 !!!

cognition_ratio: float (default: 1.49445)

Upper limit for the impact of the personal best solution on the velocity

social_ratio: float (default: 1.49445)

Upper limit for the impact of the swarm’s best solution on the velocity

reeval_frequency: int (default: None)

Number of epochs after which the personal and overall bests are reevaluated to prevent false memory introduced by varying batch data

decrease_fitness: List or Tuple (default None)

The to evaporation constants are used to reduce the remembered fitness of the bests to prevent false memory introduced by varying batch data. !!! Use either ‘reeval_frequency’or ‘decrease_fitness’ !!! with lenght == 2, will take the first value as evaporation constant for personal best and second as evaporation constant for global (neighborhood) best

pretrain_deviation: float (default: 0)

If pretrain_deviation > 0 the positions will be initialized as: model.parameters + N(mean=0.0, std=pretrain_deviation) otherwise positions will be initialized randomly over the whole search space

bound_handling: bool (default: False)

If true will clamp particle’s positions in each dim to the interval [-max_position, max_position]

velocity_clamping: bool (default: True)

If true will clamp particle’s velocities in each dim to the interval [-max_velocity, max_velocity] before adding to the positions

optimizer_type: Type[torch.optim.Optimizer]

A class implementing torch’s optimizer interface used for parameter update step. PSO does not need an torch optimizer to compute an parameter update step. -> currently only used to have an consistent interface with other learners

optimizer_hyperparams: dict strat_to_player_kwargs: dict

Dict of arguments provided to environment used for evaluating utility of current and candidate strategies.

update_strategy()[source]

Updates the player’s strategy.

update_strategy_and_evaluate_utility()[source]

updates model and returns utility after the update.

class bnelearn.learner.ReinforceLearner(hyperparams: Optional[dict] = None, **kwargs)[source]

Bases: GradientBasedLearner

REINFORCE Learner. Also known as Score function estimator.

See https://link.springer.com/article/10.1007/BF00992696 and https://pytorch.org/docs/stable/distributions.html.