Alluvial plots are capable of displaying higher dimensional data on a plane, thus lend themselves to plot the response of a statistical model to changes in the input data across multiple dimensions. The practical limit here is 4 dimensions while conventional partial dependence plots are limited to 2 dimensions.

Briefly the 4 variables with the highest feature importance for a given model are selected and 5 values spread over the variable range are selected for each. Then a grid of all possible combinations is created. All none-plotted variables are set to the values found in the first row of the training data set. Using this artificial data space model predictions are being generated. This process is then repeated for each row in the training data set and the overall model response is averaged in the end. Each of the possible combinations is plotted as a flow which is coloured by the bin corresponding to the average model response generated by that particular combination.

get_pdp_predictions(
  df,
  imp,
  m,
  degree = 4,
  bins = 5,
  .f_predict = predict,
  parallel = FALSE
)

Arguments

df

dataframe, training data

imp

dataframe, with not more then two columns one of them numeric containing importance measures and one character or factor column containing corresponding variable names as found in training data.

m

model object

degree

integer, number of top important variables to select. For plotting more than 4 will result in two many flows and the alluvial plot will not be very readable, Default: 4

bins

integer, number of bins for numeric variables, increasing this number might result in too many flows, Default: 5

.f_predict

corresponding model predict() function. Needs to accept `m` as the first parameter and use the `newdata` parameter. Supply a wrapper for predict functions with x-y syntax. For parallel processing the predict method of object classes will not always get imported correctly to the worker environment. We can pass the correct predict method via this parameter for example randomForest:::predict.randomForest. Note that a lot of modeling packages do not export the predict method explicitly and it can only be found using :::.

parallel

logical, turn on parallel processing. Default: FALSE

Value

vector, predictions

Details

For more on partial dependency plots see [https://christophm.github.io/interpretable-ml-book/pdp.html].

Parallel Processing

We are using `furrr` and the `future` package to paralelize some of the computational steps for calculating the predictions. It is up to the user to register a compatible backend (see plan).

Examples

 df = mtcars2[, ! names(mtcars2) %in% 'ids' ]
 m = randomForest::randomForest( disp ~ ., df)
 imp = m$importance

 pred = get_pdp_predictions(df, imp
                            , m
                            , degree = 3
                            , bins = 5)
#> Getting partial dependence plot preditions. This can take a while. See easyalluvial::get_pdp_predictions() `Details` on how to use multiprocessing

# parallel processing --------------------------
if (FALSE) {
 future::plan("multisession")
 
 # note that we have to pass the predict method via .f_predict otherwise
 # it will not be available in the worker's environment.
 
 pred = get_pdp_predictions(df, imp
                            , m
                            , degree = 3
                            , bins = 5,
                            , parallel = TRUE
                            , .f_predict = randomForest:::predict.randomForest)
}