Implementing Multi-Armed Bandits: A Beginner’s Hands-on Guide

The exploration-exploitation dilemma becomes comprehensible through multi-armed bandits while providing high-power functionality to novice users of decision-making systems. The guide offers step-by-step instructions to add Multi-Armed Bandits to Python through concrete explanations and comprehensive analysis of advantages and disadvantages aimed at users who are new to this field.

What Are Multi-Armed Bandits?

While at a casino you find multiple slot machines also known as one-armed bandits. All machines have probabilities which remain unknown to users. Your main task requires identifying the machine which will make the most profitable payments throughout the course of time. A slot machine with unknown payout probabilities sets the base scenario of multi-armed bandits (MAB).

Multi-armed bandits constitute models which solve problems where a collection of constrained resources need to be distributed among competing choices to achieve maximum expected outcomes. The main difficulty emerges through the need to undertake these two goals:
Exploration: The process of testing different actions enables one to gather more information.
Exploitation: The information acquired is used to obtain the maximum reward through exploitation.

Real-Life Applications of Multi-Armed Bandits

Online ad optimization
Personalized recommendations (e.g., Netflix, Amazon)
A/B testing
Clinical trials
Dynamic pricing strategies

Problem Setup

Using Python we will implement a simulation of a multi-armed bandit problem below. An environment consists of k slot machines (arms) which have completely unknown distribution of rewards.

import numpy as np

import matplotlib.pyplot as plt

class Bandit:

def __init__(self, k):

self.k = k

self.means = np.random.normal(0, 1, k) # true mean rewards

def pull(self, action):

return np.random.normal(self.means[action], 1) # noisy reward

import numpy as np import matplotlib.pyplot as plt class Bandit: def __init__(self, k): self.k = k self.means = np.random.normal(0, 1, k) # true mean rewards def pull(self, action): return np.random.normal(self.means[action], 1) # noisy reward

import numpy as np
import matplotlib.pyplot as plt

class Bandit:
    def __init__(self, k):
        self.k = k
        self.means = np.random.normal(0, 1, k)  # true mean rewards

    def pull(self, action):
        return np.random.normal(self.means[action], 1)  # noisy reward

Epsilon-Greedy Strategy

Among all strategies to manage exploration versus exploitation the epsilon-greedy method stands as the most basic solution. The Bandit Algorithm explores a random action with probability epsilon yet it chooses the most known action based on exploitation when epsilon does not apply.

def epsilon_greedy(Q, N, epsilon):

if np.random.rand() < epsilon:

return np.random.randint(len(Q)) # explore

else:

return np.argmax(Q) # exploit

def epsilon_greedy(Q, N, epsilon): if np.random.rand() < epsilon: return np.random.randint(len(Q)) # explore else: return np.argmax(Q) # exploit

def epsilon_greedy(Q, N, epsilon):
    if np.random.rand() < epsilon:
        return np.random.randint(len(Q))  # explore
    else:
        return np.argmax(Q)  # exploit

Implementing the Simulation

We now bring everything together to simulate a multi-armed bandit scenario.

k = 10

steps = 1000

epsilon = 0.1

bandit = Bandit(k)

Q = np.zeros(k) # estimated rewards

N = np.zeros(k) # action counts

rewards = []

for step in range(steps):

action = epsilon_greedy(Q, N, epsilon)

reward = bandit.pull(action)

N[action] += 1

Q[action] += (reward - Q[action]) / N[action] # incremental mean

rewards.append(reward)

plt.plot(np.cumsum(rewards) / (np.arange(steps) + 1))

plt.xlabel('Steps')

plt.ylabel('Average Reward')

plt.title('Epsilon-Greedy Bandit Performance')

plt.show()

k = 10 steps = 1000 epsilon = 0.1 bandit = Bandit(k) Q = np.zeros(k) # estimated rewards N = np.zeros(k) # action counts rewards = [] for step in range(steps): action = epsilon_greedy(Q, N, epsilon) reward = bandit.pull(action) N[action] += 1 Q[action] += (reward - Q[action]) / N[action] # incremental mean rewards.append(reward) plt.plot(np.cumsum(rewards) / (np.arange(steps) + 1)) plt.xlabel('Steps') plt.ylabel('Average Reward') plt.title('Epsilon-Greedy Bandit Performance') plt.show()

k = 10
steps = 1000
epsilon = 0.1

bandit = Bandit(k)
Q = np.zeros(k)  # estimated rewards
N = np.zeros(k)  # action counts
rewards = []

for step in range(steps):
    action = epsilon_greedy(Q, N, epsilon)
    reward = bandit.pull(action)
    N[action] += 1
    Q[action] += (reward - Q[action]) / N[action]  # incremental mean
    rewards.append(reward)

plt.plot(np.cumsum(rewards) / (np.arange(steps) + 1))
plt.xlabel('Steps')
plt.ylabel('Average Reward')
plt.title('Epsilon-Greedy Bandit Performance')
plt.show()

Visualizing Performance

The depicted chart represents the running average of obtained rewards. A properly adjusted epsilon value leads to a performance curve which consistently grows because the agent discovers optimal arm selection.

Advantages of Multi-Armed Bandits

1. Simple and Intuitive

Multi-armed bandit algorithms present basic logic that operates through a clear example comparing them to slot machines. They remain understandable to people who no longer possess a sophisticated knowledge of statistics or machine learning background. The implementation process consists of merely adding several lines of code to execute ε-greedy and UCB (Upper Confidence Bound) approaches.

2. Efficient

The difference between MABs and other supervised learning models emerges in how they work independently from requiring extensive training on marked datasets since they learn through constantly streaming data. The algorithms manage both discovery goals and practical benefits continuously without requiring previous dataset collection which makes them highly efficient inside restricted time and data situations.

3. Real-Time Learning

Decision updates from bandits occur through processing newly received information. The adaptive nature of these systems allows them to change their decisions immediately in domains which show fast user or system performance modifications (web traffic, stock trading, dynamic pricing and more).

4. Reduced Cost Compared to A/B Testing

During traditional A/B tests every experimental version gets identical treatment which squanders visitor traffic on substandard variants. The tendency of bandits to concentrate on superior performing arms throughout time reduces experimentation costs while improving user experience.

5. Versatile Applications

The framework is applicable across various domains:

Healthcare: Adaptive clinical trials
Marketing: Personalized ad or email content selection
E-commerce: Dynamic product recommendations
Finance: Portfolio selection strategies

Disadvantages of Multi-Armed Bandits

1. No Context

Basic bandit algorithms as ε-greedy and UCB lack the capability to handle contextual information elements which include user demographics together with time of day and device type. Effective personalization becomes challenging due to such decision systems. The requirement to use contextual bandits or reinforcement learning methods introduces extra complexity as solutions.

2. Fixed Horizon Dependency

MAB success depends heavily on the available number of decision making opportunities through a defined time period.

3. Sensitive to Hyperparameters

The exploration strategy ε-greedy requires adjustment of exploration rate through its epsilon parameter. Poor adjustment of parameters will generate two distinct results:

Too much exploration (wasting opportunities),
The policy allows either insufficient or too much exploration because it ends up fixating on underperforming alternatives.

The achievement of proper equilibrium demands both experimental trials and knowledge of the specific domain under consideration.

4. Limited Exploration in Low Epsilon

The algorithm displays less exploration when ε reduces over time allowing it to possibly focus on suboptimal arms while neglecting superior choices. Under these conditions the algorithm becomes unable to locate the true best action which results especially when initial feedback proves deceptive in noisy conditions.

5. Scalability Issues

Large increases in arm numbers (for instance many thousands of ads or products or web page layouts) make basic bandit algorithms less efficient and slower to complete their convergence process.

Advanced Strategies (Brief Overview)

Softmax (Boltzmann): Softmax computes action probabilities through estimated value as a probabilistic approach.
Upper Confidence Bound (UCB): The UCB method selects uncertain actions with bonus value to make optimistic decisions.
Thompson Sampling: Uses Bayesian inference for exploration.

Conclusion

Presented information shows steps for building a simple epsilon-greedy bandit strategy through Python code. You have observed how these methods maintain exploration-exploitation equilibrium and their dominating and deficient aspects. Complex reinforcement learning models derive essential elements from multi-armed bandits while also providing eminent practical usefulness. Multi-armed bandits present an ideal starting point for anyone interested in RL or its applications because they work well in ad optimization and experimental control.

Author

Rajesh

Rajesh Yerremshetty is an IIT Roorkee MBA graduate with 10 years of experience in Data Analytics and AI. He has worked with leading organizations, including CarDekho.com, Vansun Media Tech Pvt. Ltd., and STRIKIN.com, driving innovative solutions and business growth through data-driven insights.
View all posts

Spread the knowledge

Nomidl