
get predictions compatible with the partial dependence plotting method
Source:R/alluvial_model_response.R
get_pdp_predictions.Rd
Alluvial plots are capable of displaying higher dimensional data on a plane, thus lend themselves to plot the response of a statistical model to changes in the input data across multiple dimensions. The practical limit here is 4 dimensions while conventional partial dependence plots are limited to 2 dimensions.
Briefly the 4 variables with the highest feature importance for a given model are selected and 5 values spread over the variable range are selected for each. Then a grid of all possible combinations is created. All none-plotted variables are set to the values found in the first row of the training data set. Using this artificial data space model predictions are being generated. This process is then repeated for each row in the training data set and the overall model response is averaged in the end. Each of the possible combinations is plotted as a flow which is coloured by the bin corresponding to the average model response generated by that particular combination.
Usage
get_pdp_predictions(
df,
imp,
m,
degree = 4,
bins = 5,
.f_predict = predict,
parallel = FALSE
)
Arguments
- df
dataframe, training data
- imp
dataframe, with not more then two columns one of them numeric containing importance measures and one character or factor column containing corresponding variable names as found in training data.
- m
model object
- degree
integer, number of top important variables to select. For plotting more than 4 will result in two many flows and the alluvial plot will not be very readable, Default: 4
- bins
integer, number of bins for numeric variables, increasing this number might result in too many flows, Default: 5
- .f_predict
corresponding model predict() function. Needs to accept `m` as the first parameter and use the `newdata` parameter. Supply a wrapper for predict functions with x-y syntax. For parallel processing the predict method of object classes will not always get imported correctly to the worker environment. We can pass the correct predict method via this parameter for example randomForest:::predict.randomForest. Note that a lot of modeling packages do not export the predict method explicitly and it can only be found using :::.
- parallel
logical, turn on parallel processing. Default: FALSE
Details
For more on partial dependency plots see [https://christophm.github.io/interpretable-ml-book/pdp.html].
Parallel Processing
We are using `furrr` and the `future` package to paralelize some of the computational steps for calculating the predictions. It is up to the user to register a compatible backend (see plan).
Examples
df = mtcars2[, ! names(mtcars2) %in% 'ids' ]
m = randomForest::randomForest( disp ~ ., df)
imp = m$importance
pred = get_pdp_predictions(df, imp
, m
, degree = 3
, bins = 5)
#> Getting partial dependence plot preditions. This can take a while. See easyalluvial::get_pdp_predictions() `Details` on how to use multiprocessing
# parallel processing --------------------------
if (FALSE) { # \dontrun{
future::plan("multisession")
# note that we have to pass the predict method via .f_predict otherwise
# it will not be available in the worker's environment.
pred = get_pdp_predictions(df, imp
, m
, degree = 3
, bins = 5,
, parallel = TRUE
, .f_predict = randomForest:::predict.randomForest)
} # }