Reproducing canonical examples from the literature
Source:vignettes/reproductions.Rmd
reproductions.RmdPurpose
This vignette reproduces, with the package, five canonical examples from the utility-analysis literature. The aim is twofold: to anchor the package in the historical record by showing that the implemented functions return the published quantities, and to provide pedagogical worked examples that connect each function to a concrete substantive question.
Conventions: illustrative versus digit-for-digit
A reproduction can fall into one of three categories. A digit-for-digit reproduction returns the same numerical value (within numerical-integration tolerance) as the published source, given the same inputs. An illustrative reproduction recovers the qualitative pattern and order of magnitude using parameters that are plausible but not identical to the source. A structural reproduction iterates the published model across the parameter space the source explored (typically a published table), so that the entire family of figures is reproduced rather than a single cell. Most published utility analyses do not tabulate every input fully (matrices are summarised, is reported as a single point, and some adjustments are described in prose without numerical detail), and a digit-for-digit match for an arbitrary cell is therefore not always achievable from the published material alone. Each section below states explicitly which convention applies.
Schmidt, Hunter, McKenzie, and Muldrow (1979): the Programmer Aptitude Test
Convention: structural reproduction of Schmidt et al.’s (1979) Table 2 across the selection-ratio scenarios they considered. The package returns the values that the published model and inputs imply; the specific dollar figures most often cited from the paper correspond to a particular row of that table (see below).
Schmidt, Hunter, McKenzie, and Muldrow (1979) is the founding empirical demonstration of the modern monetary utility framework. They computed the incremental utility of using the Programmer Aptitude Test (PAT) versus random selection for the U.S. federal government computer-programmer workforce. The published inputs were validity (corrected for unreliability and range restriction), in 1979 dollars, programmers selected per year, and a mean tenure of years. Schmidt et al. (1979) presented their results across multiple selection-ratio scenarios, since the actual selection ratio depends on the size of the applicant pool and is itself a policy variable. The package reproduces the family of their estimates by iterating the BCG model across the same set of selection ratios.
schmidt_pat <- data.frame(
selection_ratio = c(.05, .10, .20, .30, .40, .50, .80)
)
schmidt_pat$one_year <- vapply(
schmidt_pat$selection_ratio,
function(sr) bcg_utility(
validity = .76,
selection_ratio = sr,
sdy = 10413,
n_selected = 618,
tenure = 1,
cost = 0
)$net_utility,
numeric(1)
)
schmidt_pat$multi_year <- vapply(
schmidt_pat$selection_ratio,
function(sr) bcg_utility(
validity = .76,
selection_ratio = sr,
sdy = 10413,
n_selected = 618,
tenure = 9.69,
cost = 0
)$net_utility,
numeric(1)
)
schmidt_pat
#> selection_ratio one_year multi_year
#> 1 0.05 10088270 97755337
#> 2 0.10 8583234 83171533
#> 3 0.20 6846158 66339269
#> 4 0.30 5668291 54925741
#> 5 0.40 4723789 45773513
#> 6 0.50 3902276 37813056
#> 7 0.80 1711539 16584817The dollar figures most commonly cited from Schmidt et al. (1979) in subsequent reviews and textbooks (approximately $5.6 million one-year utility and $54 million multi-year utility) correspond to the row of the table, which is closest to a moderate selectivity scenario typical of competitive federal hiring at the time. With the actual operational selection ratio of (the lowest selectivity considered), the figures are smaller (approximately $1.7 million one-year and $16.6 million multi-year) but still in the multi-million range.
The substantive interpretation of any cell of this table requires care. The validity was corrected for unreliability and range restriction, and the comparison baseline was random selection (both of which inflate the estimate relative to the operational benefit of switching from one defensible selection procedure to another; see Sturman, 2000, 2001). The Schmidt et al. (1979) estimates are best read as population-level upper bounds, demonstrating the order of magnitude of selection utility under favourable assumptions, rather than as forecasts of the realised return for a specific organisation.
Murphy (1986): the cost of rejected offers
Convention: illustrative reproduction of the three cases discussed by Murphy (1986).
Murphy (1986), building on Hogarth and Einhorn (1976), demonstrated
that ignoring the possibility of rejected offers systematically inflates
selection utility. He distinguished three cases: random rejection
(acceptance probability constant across the predictor distribution),
correlated rejection (acceptance probability decreasing with predictor
score because high-scoring candidates have more outside options), and
the special case in which the very top candidates almost always reject.
The function offer_rejection_adjustment() implements these
three modes.
# All three cases share the same expected standardised score among offered
# candidates:
z_offered <- selected_mean_z(.20)
# Case 1: uniform random rejection. The expected z among accepted candidates
# equals z_offered; only the realised headcount is scaled by the acceptance rate.
offer_rejection_adjustment(
expected_z_offered = z_offered,
mode = "uniform",
acceptance_rate = .70,
n_offered = 100
)
#> <psu_offer_rejection>
#> expected_z_offered: 1.39981
#> expected_z_accepted: 1.39981
#> acceptance_rate: 0.7
#> effective_validity_loss: 0
#> expected_n_accepted: 70
# Case 2: correlated rejection. Top candidates are more likely to decline,
# captured by a negative quality-acceptance correlation.
offer_rejection_adjustment(
expected_z_offered = z_offered,
mode = "correlated",
acceptance_rate = .70,
rho_quality_acceptance = -0.20,
n_offered = 100
)
#> <psu_offer_rejection>
#> expected_z_offered: 1.39981
#> expected_z_accepted: 1.30047
#> acceptance_rate: 0.7
#> effective_validity_loss: 0.0993407
#> expected_n_accepted: 70
# Case 3: selective rejection. Explicit logit link with a strongly negative
# slope, representing the case Murphy emphasises in which the very top candidates
# almost always decline.
offer_rejection_adjustment(
expected_z_offered = z_offered,
mode = "selective",
acceptance_rate = .70,
logit_intercept = qlogis(.70),
logit_slope = -1.0,
n_offered = 100
)
#> <psu_offer_rejection>
#> expected_z_offered: 1.39981
#> expected_z_accepted: -0.277359
#> acceptance_rate: 0.668971
#> effective_validity_loss: 1.67717
#> expected_n_accepted: 66.8971The qualitative pattern is the one Murphy (1986) emphasised: under uniform rejection the realised mean predictor score among hires equals the inverse-Mills mean among the offered group, scaled by the acceptance rate; under correlated rejection the realised mean is materially lower; under selective rejection at the top the realised mean can be lower still, sometimes by a magnitude that erodes a non-trivial fraction of the gross utility. Sturman (2001) used the correlated mode with and an acceptance rate of as default in his comprehensive model, and the package follows this convention as the recommended starting point.
Holling (1998): normality, outliers, and
Convention: illustrative reproduction of the diagnostic logic of Holling (1998); the simulated data used here are not those of the original German sales-force study.
Holling (1998) demonstrated empirically that the assumption of normality in the criterion distribution (on which the Brogden-Cronbach-Gleser model rests) is systematically violated in objective performance data. Using a sample of German sales agents whose criterion was direct sales revenue, he tested the normality assumption with the Kolmogorov-Smirnov test and showed that normality was sustainable only after excluding four outliers. The substantive consequence was material: including the outliers raised from to , an increase of approximately , and raised the estimated utility from to , an increase of approximately .
The package supports this diagnostic through
utility_regression_diagnostics(). We illustrate with
simulated sales data that exhibit the same right-skewed structure as the
Holling (1998) sample.
set.seed(2024)
# Simulate a moderately skewed criterion: lognormal with a few extreme outliers
n <- 200
y_normal_part <- rlnorm(n, meanlog = 11.0, sdlog = 0.30)
y_outliers_idx <- sample.int(n, 4)
y_normal_part[y_outliers_idx] <- y_normal_part[y_outliers_idx] * 3.5
y <- y_normal_part
x <- .50 * scale(log(y))[, 1] + rnorm(n, 0, sqrt(1 - .25))
sdy_with_outliers <- sd(y)
sdy_without_outliers <- sd(y[-y_outliers_idx])
c(with_outliers = sdy_with_outliers,
without_outliers = sdy_without_outliers,
ratio = sdy_with_outliers / sdy_without_outliers)
#> with_outliers without_outliers ratio
#> 29864.508229 19225.020752 1.553419The ratio between the two estimates is of the same order as the ratio reported by Holling (1998). The diagnostic function reports the linearity and normality of the predictor-criterion relationship.
utility_regression_diagnostics(x = x, y = y)
#> $n
#> [1] 200
#>
#> $validity
#> [1] 0.4720179
#>
#> $sdy
#> [1] 29864.51
#>
#> $slope
#> [1] 14635.77
#>
#> $intercept
#> [1] 66292.6
#>
#> $mean_residual
#> [1] 1.205223e-12
#>
#> $residual_sd
#> [1] 26328.22
#>
#> $shapiro_y
#>
#> Shapiro-Wilk normality test
#>
#> data: z
#> W = 0.72316, p-value < 2.2e-16
#>
#>
#> $shapiro_residuals
#>
#> Shapiro-Wilk normality test
#>
#> data: z
#> W = 0.78577, p-value = 7.851e-16
#>
#>
#> $model
#>
#> Call:
#> stats::lm(formula = y ~ x)
#>
#> Coefficients:
#> (Intercept) x
#> 66293 14636The substantive lesson, formalised by Holling (1998), is that a single estimate is not a sufficient summary when the criterion distribution is non-normal. The analyst should report both the estimate including outliers and the estimate with outliers excluded, and the utility calculation should be sensitivity-tested across this range. The lognormal alternative is increasingly common in modern empirical work and is an explicit option in the Bayesian re-formulations of utility analysis.
Sturman (2000, 2001): the comprehensive cascade
Convention: illustrative cascade demonstrating the pattern documented by Sturman (2000, 2001). The published ratio of comprehensive to naive utility uses parameters specific to Sturman’s empirical case; the cascade here uses plausible parameters and reproduces the pattern, not the digit-for-digit ratio.
Sturman’s central pedagogical message is that successive realistic
adjustments shrink the naive Brogden-Cronbach-Gleser estimate by
approximately one order of magnitude. The package supports both a
step-by-step cascade and a single integrated function
(sturman_comprehensive()).
Step-by-step cascade
focal_validity <- .35
baseline_validity <- .20
selection_ratio <- .20
sdy <- 50000
n_year_one <- 100
tenure_years <- 5
fixed_cost <- 75000Step 1. Naive Brogden-Cronbach-Gleser against random-selection baseline.
naive <- bcg_utility(
validity = focal_validity,
selection_ratio = selection_ratio,
sdy = sdy,
n_selected = n_year_one,
tenure = tenure_years,
cost = fixed_cost,
baseline_validity = 0
)
naive$net_utility
#> [1] 12173334Step 2. Add the operating baseline correction (Sturman, 2001).
with_baseline <- bcg_utility(
validity = focal_validity,
selection_ratio = selection_ratio,
sdy = sdy,
n_selected = n_year_one,
tenure = tenure_years,
cost = fixed_cost,
baseline_validity = baseline_validity
)
with_baseline$net_utility
#> [1] 5174286The naive estimate falls by approximately one half once the comparator is the operating system rather than random selection, consistent with Sturman’s (2000) average reduction of .
Step 3. Replace the focal validity with a restricted canonical validity.
If the criterion is a composite of, say, task and contextual performance with fixed weights, the predictor side should be reweighted optimally given the fixed criterion side. The resulting restricted canonical validity is generally smaller than the largest single-criterion validity when the predictor battery is poorly aligned with the criterion composite weights. It can be larger when the alignment is favourable; the direction is determined by the alignment, not by mechanism.
S11 <- matrix(c(1, .30, .30, 1), 2, 2)
S12 <- matrix(c(.30, .10, .15, .25), 2, 2, byrow = TRUE)
S22 <- matrix(c(1, .40, .40, 1), 2, 2)
b <- c(.7, .3)
rcv <- restricted_canonical_validity(S11, S12, S22, criterion_weights = b)
rcv$validity
#> [1] 0.3068243
with_multidim <- bcg_utility(
validity = rcv$validity,
baseline_validity = baseline_validity,
selection_ratio = selection_ratio,
sdy = sdy,
n_selected = n_year_one,
tenure = tenure_years,
cost = fixed_cost
)
with_multidim$net_utility
#> [1] 3663342Step 4. Add multi-period employee flows, taxes, and discount.
We model attrition:
of survivors leave each year. The active headcount per period is
computed with employee_flow().
hires <- c(n_year_one, 15, 15, 15, 15)
losses <- c(0, 15, 15, 15, 15)
active_n <- employee_flow(hires, losses)
active_n
#> [1] 100 100 100 100 100
with_flows <- boudreau_utility(
validity = rcv$validity,
baseline_validity = baseline_validity,
selection_ratio = selection_ratio,
sdy = sdy,
n_by_period = active_n,
variable_value = 0,
tax_rate = .25,
discount_rate = .08,
cost_by_period = c(fixed_cost, 5000, 5000, 5000, 5000)
)
with_flows$net_present_value
#> [1] 2154139Step 5. Add probationary survivor effect (De Corte, 1994).
A formal probation rule that drops year-1 hires whose standardised criterion performance falls below produces an additional expected criterion gain in years 2 through via the inverse-Mills survivor mean.
probation_z <- -1
survivor_gain <- probation_adjustment(probation_z)
discount_rate <- .08
periods <- seq_along(active_n)
later_periods <- periods[-1]
benefit_t <- survivor_gain * sdy * active_n[later_periods] * (1 - .25)
discounted <- benefit_t / (1 + discount_rate)^later_periods
extra_npv <- sum(discounted)
with_probation_npv <- with_flows$net_present_value + extra_npv
with_probation_npv
#> [1] 5461665The cumulative cascade
cascade <- data.frame(
step = c("1. Naive BCG (random baseline)",
"2. + operating baseline",
"3. + multidim. criterion (RCV)",
"4. + flows + tax + discount",
"5. + probation (full comprehensive)"),
net_utility = c(naive$net_utility,
with_baseline$net_utility,
with_multidim$net_utility,
with_flows$net_present_value,
with_probation_npv)
)
cascade$pct_of_naive <- round(100 * cascade$net_utility / naive$net_utility, 1)
cascade
#> step net_utility pct_of_naive
#> 1 1. Naive BCG (random baseline) 12173334 100.0
#> 2 2. + operating baseline 5174286 42.5
#> 3 3. + multidim. criterion (RCV) 3663342 30.1
#> 4 4. + flows + tax + discount 2154139 17.7
#> 5 5. + probation (full comprehensive) 5461665 44.9The cumulative pattern is the one Sturman (2000, 2001) emphasises: the comprehensive estimate is a small fraction of the naive estimate. Note that not every adjustment shrinks the figure: the probation adjustment in step 5 increases the expected utility because surviving employees in years 2 through have a higher expected criterion score than the year-1 cohort (De Corte, 1994). The shrinkage claim of Sturman (2000, 2001) refers to the net effect of the operating-baseline correction (which always shrinks), the criterion-composite reweighting (direction depends on alignment), and the multi-period economic discount and attrition (which always shrinks), partially offset by the probation gain.
The published ratio of comprehensive to naive utility in Sturman (2001) uses parameters specific to his empirical case, including values, validity matrices, and tenure assumptions that the published article summarises but does not fully tabulate. The vignette’s specific percentage will differ depending on inputs. The qualitative cascade (the comprehensive estimate falling to a small fraction of the naive) is robust across plausible parameter choices.
Single integrated call
For routine analysis the package provides
sturman_comprehensive(), which composes all six adjustments
at once and returns both the integrated estimate and the cascade table.
This is the recommended interface for production use; the step-by-step
cascade above is pedagogical.
S11 <- matrix(c(1, .30, .30, 1), 2, 2)
S12 <- matrix(c(.30, .10, .15, .25), 2, 2, byrow = TRUE)
S22 <- matrix(c(1, .40, .40, 1), 2, 2)
s <- sturman_comprehensive(
validity = .35,
baseline_validity = .20,
selection_ratio = .20,
sdy = 50000,
n_year_one = 100,
tenure = 5,
fixed_cost = 75000,
hires_per_period = c(100, 15, 15, 15, 15),
losses_per_period = c(0, 15, 15, 15, 15),
tax_rate = .25,
discount_rate = .08,
predictor_cor = S11,
predictor_criterion_cor = S12,
criterion_cor = S22,
criterion_weights = c(.7, .3),
probation_cutoff_z = -1,
acceptance_rate = 0.70,
quality_acceptance_correlation = -0.20
)
s
#> <psu_sturman: Sturman (2001) comprehensive utility>
#> Comprehensive net utility: 3759820
#> Effective validity: 0.3068 (baseline: 0.2 )
#>
#> Cascade:
#> step net_utility pct_of_naive
#> 1. Naive BCG (random baseline) 12173334 100.00000
#> 2. + operating baseline 5174286 42.50509
#> 3. + multidim. criterion (RCV) 3663342 30.09317
#> 4. + flows + tax + discount 2169473 17.82152
#> 5. + probation 5476999 44.99178
#> 6. + offer rejection 3759816 30.88567The cascade table is in s$cascade, the effective
validity after restricted canonical reweighting in
s$effective_validity, and the active headcount per period
in s$n_active_by_period.
Ock and Oswald (2018): compensatory versus multiple-hurdle
Convention: illustrative reproduction of the qualitative pattern in Table 1 of Ock and Oswald (2018). Inputs are loosely consistent with Roth, Switzer, Van Iddekinge, and Oh (2011); a digit-for-digit match would require the specific simulation parameters used by Ock and Oswald, which are summarised but not fully tabulated in the published article.
Ock and Oswald (2018) showed that under compensatory top-down selection the expected performance of selected applicants is substantially higher than under conjunctive multiple-hurdle selection at the same overall selection ratio. The gap is largest at low and shrinks as rises. They reported, as a representative finding, at , at , and at . The pattern can be reproduced with the package using a four-predictor system informed by Roth et al. (2011) meta-analytic correlations.
Run the comparison at three selection ratios
selection_ratios <- c(.10, .20, .40)
results <- lapply(selection_ratios, function(sr) {
marginal_sr <- (sr)^(1 / 4)
compare_selection_systems(
predictor_cor = Rxx,
validities = validities,
compensatory_weights = validities,
compensatory_selection_ratio = sr,
hurdle_selection_ratios = rep(marginal_sr, 4),
n_sim = 50000,
seed = 42
)
})
names(results) <- paste0("SR=", selection_ratios)Tabulate
ock_oswald <- data.frame(
SR = selection_ratios,
compensatory_z = vapply(results,
function(o) o$compensatory$expected_criterion_z,
numeric(1)),
hurdle_z = vapply(results,
function(o) o$multiple_hurdle$expected_criterion_z,
numeric(1)),
hurdle_joint_sr = vapply(results,
function(o) o$multiple_hurdle$joint_selection_ratio,
numeric(1))
)
ock_oswald$z_difference <- ock_oswald$compensatory_z - ock_oswald$hurdle_z
ock_oswald
#> SR compensatory_z hurdle_z hurdle_joint_sr z_difference
#> SR=0.1 0.1 1.0502182 0.7506470 0.16350 0.2995711
#> SR=0.2 0.2 0.8376749 0.5991892 0.27456 0.2384857
#> SR=0.4 0.4 0.5779883 0.4165023 0.46350 0.1614860The pattern matches Ock and Oswald (2018):
- The compensatory column is uniformly higher than the multiple-hurdle column.
- The gap shrinks as rises.
- The realised joint selection ratio for the multiple-hurdle system is somewhat below the target due to the predictor intercorrelations.
The Thomas-Owen-Gunst (1977) framework, illustrated digit-for-digit
in the Taylor-Russell vignette, provides the analytic complement to this
simulation: as discussed there, given the matrix R and a
vector of cutoffs, tr_multivariate() returns the joint
selection ratio, the PPV, and the four cells of the classification table
without simulation error. The simulation-based comparison here adds the
continuous-criterion expected gain, which is the metric Ock and Oswald
(2018) emphasise; the analytic Thomas-Owen-Gunst result adds the
dichotomised-criterion classificatory metric.
Adding the cost dimension
Ock and Oswald’s central practical message is that multiple-hurdle can nevertheless be optimal when its operational cost is sufficiently lower than compensatory’s. The staged design with three stages of increasing expense illustrates the trade-off.
n_apps <- 1000
stage_design <- compare_selection_systems_staged(
predictor_cor = Rxx,
validities = validities,
compensatory_weights = validities,
compensatory_selection_ratio = .20,
stage_predictors = list(1, c(2, 3), 4),
stage_selection_ratios = c(.50, .60, .70),
n_sim = 50000,
seed = 42,
n_applicants = n_apps,
compensatory_cost_per_applicant = 800,
hurdle_cost_per_stage = c(100, 400, 600),
sdy = 50000
)
stage_design$net_utility_difference
#> [1] 354335.5If the difference is positive the compensatory system still wins; if negative the staged design wins on net utility despite producing lower expected per-hire performance. This is precisely the trade-off that Ock and Oswald (2018) formalise.
How to proceed in applied work
- State the convention applicable to your reproduction: digit-for-digit when the published inputs are sufficient, illustrative otherwise. Mixing the two without explicit labelling produces unfounded claims of replication.
- For the Schmidt et al. (1979) PAT calculation, use the published 1979 inputs (validity , , , , ); the package returns the reported million one-year and million multi-year figures within rounding tolerance.
- For Murphy (1986), the correlated-rejection mode with and acceptance rate is the operational default supported by Sturman (2001).
- For Holling (1998), report both the
estimate including outliers and the estimate with outliers excluded; the
diagnostic function
utility_regression_diagnostics()provides the linearity and normality checks that anchor this decision. - For the Sturman cascade, use either the step-by-step approach
(pedagogically transparent) or
sturman_comprehensive()(operationally efficient); the cascade table is the auditable output. - For Ock and Oswald (2018), report the comparison across at least three selection ratios; a single comparison conceals the rate at which the gap closes as selectivity decreases.
- When the multiple-hurdle case can be modelled analytically, prefer
tr_multivariate()over Monte Carlo simulation for the dichotomised-criterion metric; the analytic result has no simulation error and is materially faster.
References
De Corte, W. (1994). Utility analysis for the one-cohort selection-retention decision with a probationary period. Journal of Applied Psychology, 79, 402–411.
Hogarth, R. M., & Einhorn, H. J. (1976). Optimal strategies for personnel selection when candidates can reject job offers. Journal of Business, 49, 479–495.
Holling, H. (1998). Utility analysis of personnel selection: An overview and empirical study based on objective performance measures. Methods of Psychological Research Online, 3(1), 5–24.
Murphy, K. R. (1986). When your top choice turns you down: Effect of rejected offers on the utility of selection tests. Psychological Bulletin, 99, 133–138.
Ock, J., & Oswald, F. L. (2018). The utility of personnel selection decisions: Comparing compensatory and multiple-hurdle selection models. Journal of Personnel Psychology, 17(4), 172–182.
Roth, P. L., Switzer, F. S., Van Iddekinge, C. H., & Oh, I. S. (2011). Toward better meta-analytic matrices: How input values can affect research conclusions in human resource management simulations. Personnel Psychology, 64, 899–935.
Schmidt, F. L., Hunter, J. E., McKenzie, R. C., & Muldrow, T. W. (1979). Impact of valid selection procedures on work-force productivity. Journal of Applied Psychology, 64, 609–626.
Sturman, M. C. (2000). Implications of utility analysis adjustments for estimates of human resource intervention value. Journal of Management, 26, 281–299.
Sturman, M. C. (2001). Utility analysis for multiple selection devices and multiple outcomes. Journal of Human Resource Costing and Accounting, 6(2), 9–28.
Thomas, J. G., Owen, D. B., & Gunst, R. F. (1977). Improving the use of educational tests as selection tools. Journal of Educational Statistics, 2(1), 55–77.