3 Do-it-yourself
Most of the methods presented in this book are available in both R and Python and can be used in a uniform way. But each of these languages has also many other tools for Explanatory Model Analysis.
In this book, we introduce various methods for instance-level and dataset-level exploration and explanation of predictive models. In each chapter, there is a section with code snippets for R and Python that shows how to use a particular method.
3.1 Do-it-yourself with R
In this section, we provide a short description of the steps that are needed to set-up the R environment with the required libraries.
3.1.1 What to install?
Obviously, the R software (R Core Team 2018) is needed. It is always a good idea to use the newest version. At least R in version 3.6 is recommended. It can be downloaded from the CRAN website https://cran.r-project.org/.
A good editor makes working with R much easier. There are plenty of choices, but, especially for beginners, consider the RStudio editor, an open-source and enterprise-ready tool for R. It can be downloaded from https://www.rstudio.com/.
Once R and the editor are available, the required packages should be installed.
The most important one is the DALEX
package in version 1.0 or newer. It is the entry point to solutions introduced in this book. The package can be installed by executing the following command from the R command line:
Installation of DALEX
will automatically take care about installation of other requirements (packages required by it), like the ggplot2
package for data visualization, or ingredients
and iBreakDown
with specific methods for model exploration.
3.1.2 How to work with DALEX
?
To conduct model exploration with DALEX
, first, a model has to be created. Then the model has got to be prepared for exploration.
There are many packages in R that can be used to construct a model. Some packages are algorithm-specific, like randomForest
for random forest classification and regression models (Liaw and Wiener 2002), gbm
for generalized boosted regression models (Ridgeway 2017), rms
with extensions for generalized linear models (Harrell Jr 2018), and many others. There are also packages that can be used for constructing models with different algorithms; these include the h2o
package (LeDell et al. 2019), caret
(Kuhn 2008) and its successor parsnip
(Kuhn and Vaughan 2019), a very powerful and extensible framework mlr
(Bischl et al. 2016), or keras
that is a wrapper to Python library with the same name (Allaire and Chollet 2019).
While it is great to have such a large choice of tools for constructing models, the disadvantage is that different packages have different interfaces and different arguments. Moreover, model-objects created with different packages may have different internal structures. The main goal of the DALEX
package is to create a level of abstraction around a model that makes it easier to explore and explain the model. Figure 3.1 illustrates the contents of the package. In particular, function DALEX::explain
is THE function for model wrapping. There is only one argument that is required by the function; it is model
, which is used to specify the model-object with the fitted form of the model. However, the function allows additional arguments that extend its functionalities. They are discussed in Section 4.2.6.
3.1.3 How to work with archivist
?
As we will focus on the exploration of predictive models, we prefer not to waste space nor time on replication of the code necessary for model development. This is where the archivist
packages help.
The archivist
package (Biecek and Kosinski 2017) is designed to store, share, and manage R objects. We will use it to easily access R objects for pre-constructed models and pre-calculated explainers. To install the package, the following command should be executed in the R command line:
Once the package has been installed, function aread()
can be used to retrieve R objects from any remote repository. For this book, we use a GitHub repository models
hosted at https://github.com/pbiecek/models. For instance, to download a model with the md5 hash ceb40
, the following command has to be executed:
Since the md5 hash ceb40
uniquely defines the model, referring to the repository object results in using exactly the same model and the same explanations. Thus, in the subsequent chapters, pre-constructed models will be accessed with archivist
hooks. In the following sections, we will also use archivist
hooks when referring to datasets.
3.2 Do-it-yourself with Python
In this section, we provide a short description of steps that are needed to set-up the Python environment with the required libraries.
3.2.1 What to install?
The Python interpreter (Rossum and Drake 2009) is needed. It is always a good idea to use the newest version. Python in version 3.6 is the minimum recommendation. It can be downloaded from the Python website https://python.org/. A popular environment for a simple Python installation and configuration is Anaconda, which can be downloaded from website https://www.anaconda.com/.
There are many editors available for Python that allow editing the code in a convenient way. In the data science community a very popular solution is Jupyter Notebook. It is a web application that allows creating and sharing documents that contain live code, visualizations, and descriptions. Jupyter Notebook can be installed from the website https://jupyter.org/.
Once Python and the editor are available, the required libraries should be installed. The most important one is the dalex
library, currently in version 0.2.0
. The library can be installed with pip
by executing the following instruction from the command line:
pip install dalex
Installation of dalex
will automatically take care of other required libraries.
3.2.2 How to work with dalex
?
There are many libraries in Python that can be used to construct a predictive model. Among the most popular ones are algorithm-specific libraries like catboost
(Dorogush, Ershov, and Gulin 2018), xgboost
(Chen and Guestrin 2016), and keras
(Gulli and Pal 2017), or libraries with multiple ML algorithms like scikit-learn
(Pedregosa et al. 2011).
While it is great to have such a large choice of tools for constructing models, the disadvantage is that different libraries have different interfaces and different arguments. Moreover, model-objects created with different library may have different internal structures. The main goal of the dalex
library is to create a level of abstraction around a model that makes it easier to explore and explain the model.
Constructor Explainer()
is THE method for model wrapping. There is only one argument that is required by the function; it is model
, which is used to specify the model-object with the fitted form of the model. However, the function also takes additional arguments that extend its functionalities. They are discussed in Section 4.3.6. If these additional arguments are not provided by the user, the dalex
library will try to extract them from the model. It is a good idea to specify them directly to avoid surprises.
As soon as the model is wrapped by using the Explainer()
function, all further functionalities can be performed on the resulting object. They will be presented in subsequent chapters in subsections Code snippets for Python.
3.2.3 Code snippets for Python
A detailed description of model exploration will be presented in the next chapters. In general, however, the way of working with the dalex
library can be described in the following steps:
- Import the
dalex
library.
- Create an
Explainer
object. This serves as a wrapper around the model.
- Calculate predictions for the model.
- Calculate specific explanations.
- Print calculated explanations.
- Plot calculated explanations.
References
Allaire, JJ, and François Chollet. 2019. keras: R Interface to Keras. https://CRAN.R-project.org/package=keras.
Biecek, Przemyslaw, and Marcin Kosinski. 2017. “archivist: An R Package for Managing, Recording and Restoring Data Analysis Results.” Journal of Statistical Software 82 (11): 1–28. https://doi.org/10.18637/jss.v082.i11.
Bischl, Bernd, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M. Jones. 2016. “mlr: Machine Learning in R.” Journal of Machine Learning Research 17 (170): 1–5. http://jmlr.org/papers/v17/15-066.html.
Chen, Tianqi, and Carlos Guestrin. 2016. “XGBoost: A Scalable Tree Boosting System.” In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 785–94. KDD ’16. ACM. https://doi.org/10.1145/2939672.2939785.
Dorogush, Anna Veronika, Vasily Ershov, and Andrey Gulin. 2018. “CatBoost: gradient boosting with categorical features support.” CoRR abs/1810.11363. http://arxiv.org/abs/1810.11363.
Gulli, Antonio, and Sujit Pal. 2017. Deep Learning with Keras. Birmingham, UK: Packt Publishing Ltd.
Harrell Jr, Frank E. 2018. Rms: Regression Modeling Strategies. https://CRAN.R-project.org/package=rms.
Kuhn, Max. 2008. “Building Predictive Models in R Using the Caret Package.” Journal of Statistical Software 28 (5): 1–26. https://doi.org/10.18637/jss.v028.i05.
Kuhn, Max, and Davis Vaughan. 2019. Parsnip: A Common Api to Modeling and Analysis Functions. https://CRAN.R-project.org/package=parsnip.
LeDell, Erin, Navdeep Gill, Spencer Aiello, Anqi Fu, Arno Candel, Cliff Click, Tom Kraljevic, et al. 2019. H2o: R Interface for H2O. https://CRAN.R-project.org/package=h2o.
Liaw, Andy, and Matthew Wiener. 2002. “Classification and regression by randomForest.” R News 2 (3): 18–22. http://CRAN.R-project.org/doc/Rnews/.
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, et al. 2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12: 2825–30.
R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Ridgeway, Greg. 2017. Gbm: Generalized Boosted Regression Models. https://CRAN.R-project.org/package=gbm.
Rossum, Guido van, and Fred L. Drake. 2009. Python 3 Reference Manual. Scotts Valley, CA: CreateSpace.