Fair Bayesian Optimization
==========================
Given the increasing importance of machine learning (ML) in our lives,
with a wide use of automated systems in several domains, such as
financial lending, hiring, criminal justice, and college admissions,
there has been a major concern for ML to unintentionally encode societal
biases and result in systematic discrimination. For example, a
classifier that is only tuned to maximize prediction accuracy may
unfairly predict a high credit risk for some subgroups of the population
applying for a loan.
In several real-world domains, accuracy is not the only objective of
interest. The model should simultaneously give guarantees on other
important aspects. Algorithmic fairness tries to find algorithms that
not only keep a high level of accuracy but also enforce a certain level
of fairness and avoid biases.
In this tutorial we are going to use the German Credit Data from the
`UCI Machine Learning Repository `__.
This dataset contains a binary classification task, where the goal is to
predict if a person has "good" or "bad" credit risk.
.. code:: python
# AutoGluon and HPO tools
import autogluon.core as ag
import pandas as pd
import numpy as np
import random
# Plots
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# Fixing seed for reproducibility
SEED = 999
random.seed(SEED)
np.random.seed(SEED)
Dataset setup and visualization
-------------------------------
This code will automatically download the data from the UCI repository.
You can download the data manually from here: `UCI german lending
dataset
link. `__
.. code:: python
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data', header=None, sep=' ')
df.columns = ["CheckingAC_Status","MaturityMonths","CreditHistory","Purpose","LoanAmount","SavingsAC",
"Employment","InstalmentPctOfIncome","SexAndStatus","OtherDebts","PresentResidenceYears",
"Property","Age","OtherInstalmentPlans","Housing","NumExistingLoans","Job",
"Dependents","Telephone","ForeignWorker","Class1Good2Bad"]
We are now ready to preprocess the raw features as following:
.. code:: python
df["target"] = df["Class1Good2Bad"].replace([1, 2], [1, 0]).astype("category")
df = df.drop(columns=["Class1Good2Bad"])
df["CheckingAC_Status"] = (
df["CheckingAC_Status"]
.replace(["A11", "A12", "A13", "A14"], ["x < 0 DM", "0 <= x < 200 DM", "x >= 200DM", "no checking account"])
.astype("category")
)
df["CreditHistory"] = (
df["CreditHistory"]
.replace(
["A30", "A31", "A32", "A33", "A34"],
["no credits", "all credits paid", "existing credits paid", "delay", "critical accnt. / other credits"],
)
.astype("category")
)
df["Purpose"] = (
df["Purpose"]
.replace(
["A40", "A41", "A42", "A43", "A44", "A45", "A46", "A47", "A48", "A49", "A410"],
[
"new car",
"used car",
"forniture",
"radio/tv",
"appliances",
"repairs",
"education",
"vacation",
"retraining",
"business",
"others",
],
)
.astype("category")
)
df["SavingsAC"] = (
df["SavingsAC"]
.replace(
["A61", "A62", "A63", "A64", "A65"],
["x < 100 DM", "100 <= x < 500 DM", "500 <= x < 1000 DM", "x >= 1000 DM", "unknown"],
)
.astype("category")
)
df["Employment"] = (
df["Employment"]
.replace(
["A71", "A72", "A73", "A74", "A75"],
["unemployed", "x < 1 year", "1 <= x < 4 years", "4 <= x < 7 years", "x >= 7 years"],
)
.astype("category")
)
df["SexAndStatus"] = (
df["SexAndStatus"]
.replace(
["A91", "A92", "A93", "A94", "A95"],
[
"male divorced/separated",
"female divorced/separated/married",
"male single",
"male married/widowed",
"female single",
],
)
.astype("category")
)
df["OtherDebts"] = (
df["OtherDebts"].replace(["A101", "A102", "A103"], ["none", "co-applicant", "guarantor"]).astype("category")
)
df["Property"] = (
df["Property"]
.replace(
["A121", "A122", "A123", "A124"],
["real estate", "soc. savings / life insurance", "car or other", "unknown"],
)
.astype("category")
)
df["OtherInstalmentPlans"] = (
df["OtherInstalmentPlans"].replace(["A141", "A142", "A143"], ["bank", "stores", "none"]).astype("category")
)
df["Housing"] = df["Housing"].replace(["A151", "A152", "A153"], ["rent", "own", "for free"]).astype("category")
df["Job"] = (
df["Job"]
.replace(
["A171", "A172", "A173", "A174"],
[
"unemployed / unskilled-non-resident",
"unskilled-resident",
"skilled employee / official",
"management / self-employed / highly qualified employee / officer",
],
)
.astype("category")
)
df["Telephone"] = df["Telephone"].replace(["A191", "A192"], ["none", "yes"]).astype("category")
df["ForeignWorker"] = df["ForeignWorker"].replace(["A201", "A202"], ["yes", "no"]).astype("category")
We can plot a few statistics to see if there is a risk of bias, due to
unbalanced data or other factors. Let's begin with checking the data we
loaded.
.. code:: python
df
We can now plot the histograms for:
(1) binary target output for good (1) or bad (0) credit risk;
(2) binary ForeignWorker feature;
(3) binary target output for individuals with ForeignWorker feature
equals to "yes";
(4) binary target output for individuals with ForeignWorker feature
equals to "no".
.. code:: python
fig, (ax1, ax2) = plt.subplots(ncols=2)
fig.tight_layout(pad=5.0)
sns.countplot(df['target'], ax=ax1)
ax1.set_title('Target')
sns.countplot(df['ForeignWorker'], ax=ax2)
ax2.set_title('ForeignWorker')
plt.show()
We can note that our dataset unbalanced, having more than double
positive examples compared to negative ones. The unbalance is ever
higher concerning the two subgroups of foreign against local workers.
.. code:: python
fig, (ax1, ax2) = plt.subplots(ncols=2)
fig.tight_layout(pad=5.0)
sns.countplot(df[df["ForeignWorker"] == 'no']['target'], ax=ax1)
ax1.set_title('Target | Non ForeignWorker')
sns.countplot(df[df["ForeignWorker"] == 'yes']['target'], ax=ax2)
ax2.set_title('Target | ForeignWorker')
plt.show()
Finally, here we have a possible source of bias: the proportion between
positive and negative target label is different between the two
subgroups.
.. code:: python
bad_credit_risk_foreign = 1.0 - np.array(df[df["ForeignWorker"] == 'yes']['target']).mean()
bad_credit_risk_local = 1.0 - np.array(df[df["ForeignWorker"] == 'no']['target']).mean()
print(f"Proportion of bad credit risk for foreign workers: {bad_credit_risk_foreign}")
print(f"Proportion of bad credit risk for non foreign workers: {bad_credit_risk_local}")
Fairness metrics
----------------
We are now interested in measuring the unfairness (bias) of our model
:math:`f` with respect to the sensitive attribute gender (i.e., *is our
model discriminating a specific group of people?*). It is important to
note that there is no consensus on a unique definition of fairness of a
ML model. In fact, some of the most common definitions are even
conflicting.
In this tutorial we decided to select a commonly used measure for bias
called Statistical Parity (SP). SP for a binary classification problem
requires the trained model to have the same probability of predicting a
positive label among the different subgourps :math:`A` and :math:`B`.
Empirically, we are interested in controlling the amount of violation of
this constraint, and we define the Difference in Statistical Parity
(DSP). Ideally, we would like this quantity to be small than a certain
threshold value :math:`\epsilon`:
.. math:: DSP(f) = \Big| \mathbb{P}_{(x,y)} \big[ f(x)>0 \, \big| \, x \text{ in group } A \big] - \mathbb{P}_{(x,y)} \big[ f(x)>0 \, \big| \, x \text{ in group } B \big] \Big| \leq \epsilon.
Another possible definition is called `Equal
Opportunity `__
(EO), defined as the property of having a similar True Positive Rate
among the different groups. Also in this case we can define the
Difference in Equal Opportunity (DEO), and our goal is to keep it
smaller that :math:`\epsilon`:
.. math:: DEO(f) = \Big| \mathbb{P}_{(x,y)} \big[ f(x)>0 \, \big| \, x \text{ in group } A, y = 1 \big] - \mathbb{P}_{(x,y)} \big[ f(x)>0 \, \big| \, x \text{ in group } B, y = 1 \big] \Big| \leq \epsilon
A lower value of these measures means a fairer model :math:`f` with
respect to the sensitive feature.
.. code:: python
from sklearn.metrics import accuracy_score
def DSP(model, X, Y, groups):
# model: the trained model
# X: our data of n examples with d features
# Y: binary labels of our n examples (1 = positive)
# groups: a list of n values binary values defining two different subgroups of the populations
fY = model.predict(X)
sp = [0, 0]
sp[0] = float(len([1 for idx, fy in enumerate(fY) if fy == 1 and groups[idx] == 0])) / len([1 for idx, fy in enumerate(fY) if groups[idx] == 0])
sp[1] = float(len([1 for idx, fy in enumerate(fY) if fy == 1 and groups[idx] == 1])) / len([1 for idx, fy in enumerate(fY) if groups[idx] == 1])
return abs(sp[0] - sp[1])
def DEO(model, X, Y, groups):
# model: the trained model
# X: our data of n examples with d features
# Y: binary labels of our n examples (1 = positive)
# groups: a list of n values binary values defining two different subgroups of the populations
fY = model.predict(X)
eo = [0, 0]
eo[0] = float(len([1 for idx, fy in enumerate(fY) if fy == 1 and groups[idx] == 0 and Y[idx] == 1])) / len([1 for idx, fy in enumerate(fY) if groups[idx] == 0 and Y[idx] == 1])
eo[1] = float(len([1 for idx, fy in enumerate(fY) if fy == 1 and groups[idx] == 1 and Y[idx] == 1])) / len([1 for idx, fy in enumerate(fY) if groups[idx] == 1 and Y[idx] == 1])
return abs(eo[0] - eo[1])
Encoding of the categorical features of our dataset:
.. code:: python
to_hot_encode = ['CheckingAC_Status', 'CreditHistory', 'Purpose', 'SavingsAC', 'Employment', 'SexAndStatus', 'OtherDebts', 'Property', 'OtherInstalmentPlans', 'Housing', 'Job', 'Telephone', 'ForeignWorker']
coded_df = pd.concat((df[to_hot_encode],
pd.get_dummies(df, columns=to_hot_encode, drop_first=True)),
axis=1)
for col in to_hot_encode:
coded_df.pop(col)
coded_df
Standard Bayesian Optimization
------------------------------
We choose a RandomForest classifier as our base ML model for this task,
and we run standard BO to optimize the hyperparameters of it. While we
optimize the accuracy of our model, we keep track of its DSP.
.. code:: python
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
Y = coded_df.pop('target').values
X = coded_df.values
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)
print('X shape:', X.shape)
print('Y shape:', Y.shape)
.. code:: python
def process_training_history(task_dicts, start_timestamp,
runtime_fn=None):
task_dfs = []
for task_id in task_dicts:
task_df = pd.DataFrame(task_dicts[task_id])
error = 1.0 - task_df["objective"]
is_fair = (task_df["constraint_metric"] < 0.0).values
if is_fair:
fair_error = error
else:
fair_error = 1.0 # worst possible value
task_df = task_df.assign(task_id=task_id,
error=error,
fair_error=fair_error,
is_fair=is_fair)
task_dfs.append(task_df)
result = pd.concat(task_dfs, axis="index", ignore_index=True, sort=True)
# calculate incumbent best -- the cumulative minimum of the error.
result = result.assign(best=result["error"].cummin())
result = result.assign(fair_best=result["fair_error"].cummin())
return result
.. code:: python
ACTIVE_METRIC_NAME = 'obj_metric'
CONSTRAINT_METRIC_NAME = 'constr_metric'
REWARD_ATTR_NAME = 'objective'
FAIRNESS_THRESHOLD = 0.01
FAIRNESS_DEFINITION = DSP # You can use any fairness definition, such as DSP or DEO
We tune RF on a 3-dimensional search space:
- min\_samples\_split in [0.01, 0.5] (log scaled)
- tree maximum depth in {1, 2, 3, 4, 5}
- criterion for quality of split in {Gini, Entropy}
.. code:: python
def create_train_fn_constraint(fairness_threshold, fairness_definition):
@ag.args(min_samples_split=ag.space.Real(lower=0.01, upper=1.0, log=True),
max_depth=ag.space.Int(lower=1, upper=50),
criterion=ag.space.Categorical('gini', 'entropy')
)
def run_opaque_box(args, reporter):
opaque_box_eval = opaque_box(args.min_samples_split,
args.max_depth,
args.criterion,
fairness_threshold,
fairness_definition)
reporter(objective=opaque_box_eval[ACTIVE_METRIC_NAME],
constraint_metric=opaque_box_eval[CONSTRAINT_METRIC_NAME])
return run_opaque_box
run_opaque_box = create_train_fn_constraint(fairness_threshold=FAIRNESS_THRESHOLD,
fairness_definition=FAIRNESS_DEFINITION)
# fairness constraint: DSP < 0.01
.. code:: python
def opaque_box(min_samples_split, max_depth, criterion, fairness_threshold, fairness_definition):
classifier = RandomForestClassifier(max_depth=max_depth,
min_samples_split=min_samples_split,
criterion=criterion)
classifier.fit(X_train, Y_train)
DSP_ForeignWorker = fairness_definition(classifier, X_test, Y_test, df["ForeignWorker"].values == "yes")
accuracy = accuracy_score(classifier.predict(X_test), Y_test)
evaluation_dict = {}
evaluation_dict[ACTIVE_METRIC_NAME] = accuracy
evaluation_dict[CONSTRAINT_METRIC_NAME] = DSP_ForeignWorker - fairness_threshold # If DSP < fairness threshold, then fair
return evaluation_dict
.. code:: python
# Create scheduler and searcher:
# First get_config are random, the remaining ones use constrained BO
search_options = {
'random_seed': SEED,
'num_fantasy_samples': 5,
'num_init_random': 1,
'debug_log': True}
myscheduler = ag.scheduler.FIFOScheduler(
run_opaque_box,
searcher='bayesopt',
search_options=search_options,
num_trials=15,
reward_attr=REWARD_ATTR_NAME
)
# Run HPO experiment
myscheduler.run()
myscheduler.join_jobs()
.. code:: python
results_df_standard = process_training_history(myscheduler.training_history.copy(),
start_timestamp=myscheduler._start_time)
results_df_standard.head()
Let's look at the empirical probability that standard BO finds a fair
model (with respect to the constraint of DPS < 0.01), the average DSP
and classification error.
.. code:: python
print('Average DSP (unfairness):', np.mean(results_df_standard['constraint_metric'] + FAIRNESS_THRESHOLD))
print('Average classification error:', np.mean(results_df_standard['best']))
A higher value of DSP can potentially highlight a discriminatory
behavior of our model.
A possible way to be more effective in finding unbiased models is to
search for set of hyperparamters that is able to generate a model that
is both fair and accurate. A solution to find accurate model under
fairness constraints is provided by the Constrained Bayesian
Optimization framework (CBO). This technique allows us to specify any
constraint, such as “DSP < 0.1” and searching for hyperparameters able
to generate accurate models such that the constrained is not violated.
Constrained Bayesian Optimization
---------------------------------
In our example, using German Credit data we can now easily run CBO
trying to find the most accurate model such that DSP<0.1. Also in this
case we selected a Random Forest as our base ML model.
.. code:: python
run_opaque_box = create_train_fn_constraint(fairness_threshold=FAIRNESS_THRESHOLD,
fairness_definition=FAIRNESS_DEFINITION)
# fairness constraint: DSP < 0.01
.. code:: python
# Create scheduler and searcher:
# First get_config are random, the remaining ones use constrained BO
search_options = {
'random_seed': SEED,
'num_fantasy_samples': 5,
'num_init_random': 1,
'debug_log': True}
myscheduler = ag.scheduler.FIFOScheduler(
run_opaque_box,
searcher='constrained_bayesopt',
search_options=search_options,
num_trials=15,
reward_attr=REWARD_ATTR_NAME,
constraint_attr='constraint_metric'
)
# Run HPO experiment
myscheduler.run()
myscheduler.join_jobs()
.. code:: python
results_df = process_training_history(myscheduler.training_history.copy(),
start_timestamp=myscheduler._start_time)
results_df.head()
Let's see the empirical probability that the *constrained* BO procedure
finds a fair model (with respect to the constraint of DPS < 0.01), the
average DSP and classification error.
.. code:: python
print('Average DSP (unfairness):', np.mean(results_df['constraint_metric'] + FAIRNESS_THRESHOLD))
print('Average classification error:', np.mean(results_df['error']))
FairBO is able to better focus on the fair region of the search space.
Standard vs CBO
---------------
The presented method is model-agnostic and is able to handle any
statistical notion of fairness.
.. code:: python
for method_name in ['BO', 'CBO']:
if method_name == 'BO':
df = results_df_standard
else:
df = results_df
unfairness = list(df['constraint_metric'] + FAIRNESS_THRESHOLD)
accuracies = list(1.0 - df['error'])
ub = FAIRNESS_THRESHOLD
scaling = 1
best_acc = 0.0
unf_best = 1.0
for acc, unf in zip(accuracies, unfairness):
if unf <= ub * scaling and acc > best_acc:
best_acc = acc
unf_best = unf
cmap = sns.cubehelix_palette(as_cmap=True)
f, ax = plt.subplots()
points = ax.scatter(x=accuracies, y=unfairness, c=list(range(len(accuracies))), alpha=0.5, cmap=cmap)
plt.scatter(best_acc, unf_best, color="red", marker='*', label='Fair best', s=100, alpha=1)
plt.xlim(0.67, 0.78)
plt.ylim(-0.01, 0.115)
plt.legend(loc='upper right')
plt.axhline(y=ub * scaling, xmin=0, xmax=1, color='black', linestyle='--', alpha=1, linewidth=1)
f.colorbar(points, label='iteration')
plt.xlabel('validation accuracy')
plt.ylabel(f'DSP')
algorithm_name = 'Random Forest'
dataset_name = 'German'
plt.title(f'{method_name} ({algorithm_name} on {dataset_name})')
In the plots above, the horizontal line is the fairness constraint, set
to DSP ≤ 0.01, and darker dots correspond to later BO iterations.
Standard BO can get stuck in high-performing yet unfair regions, failing
to return a well-performing, feasible solution. CBO is able to focus the
exploration over the fair area of the hyperparameter space, and finds a
more accurate fair solution.
The presented method is model-agnostic and is able to handle any
statistical notion of fairness. For instance, you can repeat the
experiments plugging in a constraint on DEO instead of DSP.