bnelearn.learner module¶
Implements multi-agent learning rules
- class bnelearn.learner.AESPGLearner(model: NeuralNetStrategy, environment: Environment, hyperparams: dict, optimizer_type: Type[Optimizer], optimizer_hyperparams: dict, strat_to_player_kwargs: Optional[dict] = None)[source]¶
Bases:
GradientBasedLearner
Implements Deterministic Policy Gradients http://proceedings.mlr.press/v32/silver14.pdf with ES-pseudogradients of dQ/da
- class bnelearn.learner.DDPGLearner[source]¶
Bases:
GradientBasedLearner
Implements Deep Deterministic Policy Gradients (Lilicrap et al 2016)
- class bnelearn.learner.DPGLearner[source]¶
Bases:
GradientBasedLearner
Implements Deterministic Policy Gradients
http://proceedings.mlr.press/v32/silver14.pdf
via directly calculating dQ/da and da/dtheta
- class bnelearn.learner.DummyNonLearner(model: Module, environment: Environment, hyperparams: dict, optimizer_type: Type[Optimizer], optimizer_hyperparams: dict, strat_to_player_kwargs: Optional[dict] = None)[source]¶
Bases:
GradientBasedLearner
A learner that does nothing.
- class bnelearn.learner.ESPGLearner(hyperparams: dict, **kwargs)[source]¶
Bases:
GradientBasedLearner
Neural Self-Play with Evolutionary Strategy Pseudo-PG as proposed in Bichler et. al (2021).
Uses pseudo-policy gradients calculated as
(rewards - baseline).mean() * epsilons / sigma**2
over a population of models perturbed by parameter noise epsilon yielding perturbed rewards.
- Arguments:
model: bnelearn.bidder environment: bnelearn.Environment hyperparams: dict
- (required:)
population_size: int sigma: float scale_sigma_by_model_size: bool
- (optional:)
- normalize_gradients: bool (default: False)
If true will scale rewards to N(0,1) in weighted-noise update: (F - baseline).mean()/sigma/F.std() resulting in an (approximately) normalized vector pointing in the same direction as the true gradient. (normalization requires small enough sigma!) If false or not provided, will approximate true gradient using current utility as a baseline for variance reduction.
- baseline: (‘current_reward’, ‘mean_reward’ or a float.)
If ‘current_reward’, will use current utility before update as a baseline. If ‘mean_reward’, will use mean of candidate rewards.
For small perturbations, ‘mean_reward’ is cheaper to compute (one fewer game played) and yields slightly lower gradient sample variance but yields a biased estimate of the true gradient:
Expect(ES_grad with mean) = (pop_size - 1) / pop_size * true_grad
If a float is given, will use that float as reward. Defaults to ‘current_reward’ if normalize_gradients is False, or to ‘mean_reward’ if normalize_gradients is True.
- regularization: dict of
initial_strength: float, initial penalization factor of bid value regularize_decay: float, decay rate by which the regularization factor
is multiplied each iteration.
- symmetric_sampling: bool
whether or not we sample symmetric pairs of perturbed parameters, e.g. p + eps and p - eps.
- optimizer_type: Type[torch.optim.Optimizer]
A class implementing torch’s optimizer interface used for parameter update step.
- strat_to_player_kwargs: dict
dict of arguments provided to environment used for evaluating utility of current and candidate strategies.
- class bnelearn.learner.GradientBasedLearner(model: Module, environment: Environment, optimizer_type: Type[Optimizer], optimizer_hyperparams: dict, scheduler_type: Optional[Type[_LRScheduler]] = None, scheduler_hyperparams: Optional[dict] = None, strat_to_player_kwargs: Optional[dict] = None, smooth_market: bool = False, log_gradient_variance: bool = False)[source]¶
Bases:
Learner
A learning rule that is based on computing some version of (pseudo-) gradient, then applying an SGD-like update via a
torch.optim.Optimizer
- update_strategy(closure: Optional[Callable] = None) Tensor [source]¶
Performs one model-update to the player’s strategy.
- Params:
- closure: (optional) Callable that recomputes model loss.
Required by some optimizers such as LBFGS. When given, optimizer.step() (and thus this function) return the last evaluated loss. (Usually evaluated BEFORE the model update). For correct usage see: https://pytorch.org/docs/stable/optim.html#optimizer-step-closure
Returns: None or loss evaluated by closure. (See above.)
- class bnelearn.learner.Learner[source]¶
Bases:
ABC
A learning rule used to update a player’s policy in self-play
- class bnelearn.learner.PGLearner(hyperparams: dict, **kwargs)[source]¶
Bases:
GradientBasedLearner
Neural Self-Play with directly computed Policy Gradients.
- class bnelearn.learner.PSOLearner(model: Module, environment: Environment, hyperparams: dict, optimizer_type: Type[Optimizer], optimizer_hyperparams: dict, scheduler_type: Optional[Type[_LRScheduler]] = None, scheduler_hyperparams: Optional[dict] = None, strat_to_player_kwargs: Optional[dict] = None, smooth_market: bool = False, log_gradient_variance: bool = False)[source]¶
Bases:
Learner
Implements the Particle Swarm Optimization Algorithm as a Learner Particles represent a possible solutions to the model parameters. Every update step they move one step in the search space to sample a new solution point. They are guided by their previously best found solution (personal best position) and the best solution found by the entire swarm (best position) NOTE: dim = number of parameters in the model to be optimized Arguments:
model: bnelearn.bidder environment: bnelearn.Environment hyperparams: dict
- (required:)
- swarm_size: int
Number of particles in the swarm
- topology: str
Defines the communication network of the swarm If ‘global’, particles are drawn to the global best position of the swarm.
Neighborhood size = swarm size
- If ‘ring’, particles are drawn to the best position in their neighborhood.
Particles form a neighborhood based on their position in the population array. The first and last particles are connected to form a ring structure. Neighborhood size = 3. E.g., neighborhood of particle i: particle i-1, particle i, particle i+1
- If ‘von_neumann’, particles are drawn to the best position in their neighborhood.
Particles form a neighborhood based on their position in the population matrix. A particle is connected to its left, right, upper and lower neighbor in the matrix. Neighborhood size = 5
- max_velocity: float
Max step size in each direction during one update step If velocity_clamping == False then only used for initialization
- (optional:)
The default values for the inertia weight and the cognition & social ratio are commonly used values performing well form most problem settings. Based on: Clerc, M., & Kennedy, J. (2002) inertia_weight: float, List, Tuple (default: 0.792)
Scales the impact of the old velocity on the new one. If float, will set value as constant If List or Tuple, with lenght == 2, will take the first value as w_max and second as w_min for a linear decreasing inertia weight !!! max number of iteration is hardcoded to 2000 !!!
- cognition_ratio: float (default: 1.49445)
Upper limit for the impact of the personal best solution on the velocity
- social_ratio: float (default: 1.49445)
Upper limit for the impact of the swarm’s best solution on the velocity
- reeval_frequency: int (default: None)
Number of epochs after which the personal and overall bests are reevaluated to prevent false memory introduced by varying batch data
- decrease_fitness: List or Tuple (default None)
The to evaporation constants are used to reduce the remembered fitness of the bests to prevent false memory introduced by varying batch data. !!! Use either ‘reeval_frequency’or ‘decrease_fitness’ !!! with lenght == 2, will take the first value as evaporation constant for personal best and second as evaporation constant for global (neighborhood) best
- pretrain_deviation: float (default: 0)
If pretrain_deviation > 0 the positions will be initialized as: model.parameters + N(mean=0.0, std=pretrain_deviation) otherwise positions will be initialized randomly over the whole search space
- bound_handling: bool (default: False)
If true will clamp particle’s positions in each dim to the interval [-max_position, max_position]
- velocity_clamping: bool (default: True)
If true will clamp particle’s velocities in each dim to the interval [-max_velocity, max_velocity] before adding to the positions
- optimizer_type: Type[torch.optim.Optimizer]
A class implementing torch’s optimizer interface used for parameter update step. PSO does not need an torch optimizer to compute an parameter update step. -> currently only used to have an consistent interface with other learners
optimizer_hyperparams: dict strat_to_player_kwargs: dict
Dict of arguments provided to environment used for evaluating utility of current and candidate strategies.
- class bnelearn.learner.ReinforceLearner(hyperparams: Optional[dict] = None, **kwargs)[source]¶
Bases:
GradientBasedLearner
REINFORCE Learner. Also known as Score function estimator.
See https://link.springer.com/article/10.1007/BF00992696 and https://pytorch.org/docs/stable/distributions.html.