Regression Multilevel DEA Clustering Quantile Outliers SEM Growth Prioritization Simulation
Explanatory Model 1

Cost Drivers Analysis

Multiple Regression Dominance Analysis

The foundational question: what predicts instructional cost per FTE student after controlling for institutional characteristics? This model identifies which factors your institution can influence and which are structural constraints.

The Question It Answers

Is your instructional cost high because of factors you control (faculty mix, program composition) or factors you don't (state cost of living, Carnegie classification)? The regression residual tells you exactly how much your actual spending deviates from what the model predicts for an institution with your profile.

Model Specification

Cost_per_FTE = β0 + β1(FTE Enrollment) ← scale effect + β2(% Graduate Students) ← program mix + β3(% Tenured/Tenure-Track) ← faculty investment + β4(% Full-Time Faculty) ← faculty structure + β5(Research Exp per T/TT) ← research intensity + β6(Carnegie Classification) ← institutional type + β7(Public/Private Control) ← funding model + β8(Urbanization) ← cost of living proxy + ε

What You Get

DeliverableDescription
Coefficient tableWhich factors significantly predict cost, with direction and magnitude. A one-unit increase in % T/TT faculty predicts a $X increase in cost/FTE.
Partial R² decompositionHow much variance each predictor explains, using Shapley values (dominance analysis). Answers: "Faculty mix explains 28% of cost variation; research intensity explains 19%."
Residual analysisYour residual = actual cost minus predicted cost. Positive = spending more than expected. Negative = running lean. This is your efficiency signal.
Predicted vs. actual plotVisual showing where your institution falls relative to the regression line among all peers.
Scenario analysis"If you shifted faculty mix by X%, your predicted cost would change by $Y." Actionable what-if modeling grounded in the regression coefficients.

Assumptions We Test

  • Linearity (partial regression plots)
  • Homoscedasticity (Breusch-Pagan test)
  • Multicollinearity (VIF < 5 for all predictors)
  • Normality of residuals (Q-Q plot, Shapiro-Wilk)
  • Influential observations (Cook's distance — feeds into Outlier Detection model)

Key insight: The original Delaware Cost Study reported only descriptive benchmarks (means, quartiles). Our regression model goes further by answering why costs differ — separating controllable factors from structural ones.

R Implementation

library(car) # VIF, diagnostic plots library(relaimpo) # Dominance analysis (Shapley values) library(performance) # Model diagnostics library(ggplot2) # Visualization model <- lm(cost_per_fte ~ enrollment + pct_grad + pct_tt + pct_ft + research_per_tt + carnegie + control + urbanization, data = ipeds) # Dominance analysis calc.relimp(model, type = "lmg", rela = TRUE)
Explanatory Model 2

State & Sector Context

Multilevel Modeling (HLM) Random Effects

Institutions exist within states that have fundamentally different funding models, cost of living, and regulatory environments. A standard regression treats every institution as independent — but a public university in Mississippi and one in California operate in very different cost contexts. Multilevel modeling accounts for this nesting structure.

The Question It Answers

How much of your cost is driven by state-level factors (things you can't control) versus institution-level decisions (things you can)? And does the relationship between, say, faculty mix and cost differ depending on what state you're in?

Model Structure

Level 1 (Institution i within State j): Cost_per_FTEij = β0j + β1(% T/TT)ij + β2(Research Intensity)ij + β3(Enrollment)ij + rij Level 2 (State j): β0j = γ00 + γ01(State Appropriation per FTE)j + γ02(Median Faculty Salary in State)j + u0j

What You Get

DeliverableDescription
Intraclass Correlation (ICC)The percentage of cost variation that is between states vs. within states. If ICC = 0.25, then 25% of cost differences are explained by which state you're in.
State random effectsA ranked list of all 50 states showing their cost premium or discount after controlling for institutional characteristics. "Being in California adds $1,200 to predicted cost/FTE."
Cross-level interactionsDoes the effect of faculty mix on cost differ by state funding level? In high-appropriation states, adding T/TT faculty might cost less because state subsidies offset it.
Contextual effectsSeparating within-state effects from between-state effects of the same predictor. The meaning of "high % T/TT" may differ depending on whether you're comparing within your state or across states.
Caterpillar plotVisual ranking of all state random intercepts with 95% confidence intervals — instantly shows which states are significantly above or below average.

Why this matters: If your cost is high but your state random effect is also high, the "problem" may be geographic, not institutional. This reframes the conversation from "we're inefficient" to "we're operating in an expensive state, and here's how we compare to peers in the same context."

R Implementation

library(lme4) # Multilevel models library(lmerTest) # Significance tests for MLM library(sjPlot) # Caterpillar plots, coefficient tables library(performance) # ICC, model comparison model <- lmer(cost_per_fte ~ pct_tt + research_per_tt + enrollment + state_approp_per_fte + median_salary_state + (1 | state), data = ipeds) icc(model) # Intraclass correlation ranef(model) # State-level random effects
Explanatory Model 3

Efficiency Frontier

Data Envelopment Analysis Malmquist Index

Data Envelopment Analysis (DEA) is a non-parametric method that constructs an empirical efficiency frontier from the data. The original Delaware Cost Study cited DEA in multiple conference presentations (AIR 2018, ACE 2019) as the next-generation analytical method — but never implemented it. We build what they envisioned.

The Question It Answers

Given your inputs (expenditures, faculty), are you producing the maximum possible outputs (students served, degrees awarded)? If not, how far are you from the frontier, and which specific efficient institutions should you benchmark against?

How DEA Works

Outputs (Students, Degrees) ↑ | * A (efficient) | / | / * B (efficient) | / | / * C (inefficient, score = 0.78) | / / | / * D (efficient) | / |/________________→ Inputs (Expenditures, Faculty) The frontier connects A, B, D. C is 22% below the frontier. C's reference peers are B and D.
Input/OutputVariables
InputsInstructional expenditures, FTE instructional staff, % T/TT faculty
OutputsFTE students served, total degrees awarded, research expenditures (for R1s)

DEA Variants We Run

VariantPurpose
CRS (Constant Returns to Scale)Overall technical efficiency — are you on the frontier regardless of size?
VRS (Variable Returns to Scale)Scale-adjusted efficiency — are you efficient for your size? Small colleges aren't penalized for not operating at R1 scale.
Scale EfficiencyCRS score / VRS score — are you operating at optimal scale? If not, should you grow or shrink?
Super-efficiencyRanks institutions on the frontier against each other. Useful when multiple institutions score 1.0.
Malmquist Productivity IndexTracks efficiency change over time, decomposed into: (a) efficiency change (catching up to the frontier) and (b) technical change (the frontier itself shifting).

What You Get

  • Efficiency score (0 to 1): 1.0 = on the frontier. 0.82 = you could produce the same outputs with 18% fewer inputs (or 18% more outputs with the same inputs).
  • Reference peer set: The specific efficient institutions that form your benchmark. "Your reference peers are Iowa State, Kansas State, and Missouri — here's what they do differently."
  • Slack analysis: Exactly how much of each input to reduce or output to increase to reach the frontier. "Reduce non-personnel expenditures by $420K while maintaining current enrollment."
  • Frontier visualization: 2D projection showing the frontier curve and your institution's position.
  • Malmquist decomposition: "Your efficiency improved 4% from 2020 to 2024, but the frontier shifted outward 6%, so your relative position actually declined."

Why DEA over regression: Regression estimates an average relationship. DEA identifies the best practice frontier. An institution can be average by regression standards but far from the efficiency frontier. DEA also handles multiple inputs and outputs simultaneously without assuming a linear relationship.

R Implementation

library(Benchmarking) # Core DEA library(deaR) # Alternative with Malmquist library(productivity) # Malmquist Productivity Index # Input-oriented VRS model dea_result <- dea(X = inputs, Y = outputs, RTS = "vrs", ORIENTATION = "in") # Malmquist index (requires panel data) malm <- malmquist(X, Y, id, time, orientation = "input")
Explanatory Model 4

True Peer Groups

K-Means Clustering Hierarchical Clustering Mahalanobis Distance

Carnegie classification is the default peer grouping in higher education — but it's a blunt instrument. Two "Doctoral: Very High Research" institutions can have wildly different cost structures, faculty compositions, and enrollment profiles. Research from the original Cost Study's own presentations (NEAIR 2021) demonstrated that data-driven peer groups outperform Carnegie-based ones for benchmarking.

The Question It Answers

Who are your real peers based on your actual instructional cost and productivity profile? And how do you perform within that true peer group?

Method

  • Feature selection: Standardize all 8 core metrics (z-scores) so no single metric dominates due to scale differences
  • Optimal k determination: Run silhouette analysis, gap statistic, and elbow method to find the natural number of clusters — not arbitrarily chosen
  • K-means clustering: Partition institutions into k groups that minimize within-cluster variance
  • Hierarchical clustering: Run as validation — do both methods produce similar groupings? (Ward's method with Euclidean distance)
  • Cluster profiling: Characterize each cluster by its metric signature using radar/spider charts

What You Get

DeliverableDescription
Cluster assignmentYour institution is placed in a named, characterized peer group (e.g., "Research-Active, Teaching-Heavy" or "Low-Cost, High-Throughput").
Cluster profilesRadar charts showing the metric signature of each cluster. Instantly see what makes your group distinctive.
Carnegie cross-tabulationHow do data-driven groups differ from Carnegie classification? Some Carnegie classes may split into multiple clusters; others may merge.
Within-cluster benchmarksYour rank on each metric within your data-driven peer group — more meaningful than Carnegie-wide comparisons.
10 nearest neighborsThe institutions most statistically similar to yours across all dimensions (Mahalanobis distance). Your true comparators.

Example finding: "Oklahoma State is classified as 'Doctoral: Very High Research' by Carnegie, but our cluster analysis places it in a group of 47 institutions we call 'Research-Active, Teaching-Heavy' — high research expenditure combined with high student-to-faculty ratios. This is a distinct cost profile from the typical R1 pattern."

R Implementation

library(factoextra) # Visualization (silhouette, fviz_cluster) library(cluster) # PAM, silhouette library(NbClust) # 30 indices for optimal k library(fpc) # Cluster validation statistics # Optimal k nb <- NbClust(scaled_data, method = "kmeans", min.nc = 2, max.nc = 12) # K-means km <- kmeans(scaled_data, centers = optimal_k, nstart = 25) # Nearest neighbors mahal_dist <- mahalanobis(scaled_data, center = target_inst, cov = cov(scaled_data))
Explanatory Model 5

Distribution Analysis

Quantile Regression Conditional Distributions

Standard regression tells you what predicts average cost. But the factors driving cost for the median institution may be entirely different from those driving cost at the 90th percentile. Quantile regression estimates separate models at different points in the cost distribution.

The Question It Answers

Do the same factors matter equally across the cost spectrum? Is research intensity irrelevant for low-cost institutions but a dominant factor for high-cost ones?

Model

Qτ(Cost_per_FTE | X) = β0(τ) + β1(τ)(% T/TT) + β2(τ)(Research) + ... where τ ∈ {0.10, 0.25, 0.50, 0.75, 0.90} Each τ produces a different set of coefficients.

What You Get

DeliverableDescription
Quantile coefficient plotsLine plots showing how each predictor's coefficient changes across τ = 0.10 to 0.90. Where the line is flat, the effect is constant. Where it rises or falls, the effect depends on where you sit in the cost distribution.
Conditional distributionGiven your institution's characteristics, the full predicted distribution of cost — not just a point estimate. "Institutions with your profile range from $6,800 (10th percentile) to $12,400 (90th percentile)."
Tail analysisWhat drives the most expensive institutions? If the 90th percentile coefficient for research intensity is 3x the median coefficient, research buyouts are disproportionately driving costs at the top.

Real-world value: A provost asking "why is our cost high?" gets a fundamentally different answer from quantile regression than from OLS. If the OLS model says faculty mix explains 28% of cost variation, the quantile model might show that faculty mix explains 40% of variation at the 90th percentile but only 15% at the 25th percentile.

R Implementation

library(quantreg) # Core quantile regression library(ggplot2) # Coefficient plots # Estimate at 5 quantiles taus <- c(0.10, 0.25, 0.50, 0.75, 0.90) qr_model <- rq(cost_per_fte ~ pct_tt + research_per_tt + enrollment + pct_grad, tau = taus, data = ipeds) summary(qr_model) plot(qr_model) # Coefficient plots across quantiles
Explanatory Model 6

Anomaly Detection

Mahalanobis Distance Cook's Distance DBSCAN

Before acting on any benchmarking result, you need to know whether your institution is an outlier — and if so, why. We apply four complementary methods because each catches different types of anomalies.

Four Methods, One Answer

MethodWhat It CatchesHow It Works
Mahalanobis distanceMultivariate outliersMeasures how far an institution is from the center of the data across all metrics simultaneously, accounting for correlations. An institution can look normal on every single metric but be unusual in the combination.
Cook's distanceInfluential observationsFrom the regression model — identifies institutions that disproportionately affect the regression results. Removing this institution would change the coefficients significantly.
DBSCANDensity-based outliersFinds institutions that don't belong to any natural cluster. Unlike Mahalanobis (which assumes a single center), DBSCAN works with arbitrary shapes and identifies "noise points."
IQR fencingPer-metric outliersSimple, transparent flagging for the scorecard. Any metric beyond 1.5 × IQR from the median gets flagged. Easy for stakeholders to understand.

What You Get

  • Outlier flag column in every report (per-metric and multivariate)
  • Diagnostic narrative for each flagged institution: which metrics are unusual, how unusual, and what characteristics you share with other outliers
  • Population-level report: "These 15 institutions have cost/FTE > 1.5 IQR above their Carnegie median — here's what they have in common (high research intensity, low enrollment, private control)"
  • Decision guidance: "Your outlier status on X metric is driven by Y — this may be intentional (mission-driven) or a signal to investigate"

Why four methods: No single outlier detection method is complete. Mahalanobis misses non-elliptical clusters. Cook's only works with the regression model. DBSCAN requires choosing epsilon. IQR is univariate only. Together, they provide a comprehensive picture.

Strategic Model 1

Causal Pathway Modeling

Structural Equation Modeling Measurement Invariance

This is the most sophisticated model in our toolkit — and it's led by Dr. Jam Khojasteh, Associate Editor of the Structural Equation Modeling: A Multidisciplinary Journal, with 40+ publications in SEM and related methods. SEM doesn't just identify what predicts cost — it maps the full web of causal pathways showing how factors relate to each other in producing cost.

The Question It Answers

How do research intensity, faculty investment, and enrollment profile interact to produce instructional cost? What are the direct effects (research → cost) versus the indirect effects (research → faculty mix → workload → cost)? Does the model work the same way for public and private institutions?

Model Structure

Institutional Resources (Latent) | +-----------+-----------+ | | | v v v Research Faculty Enrollment Intensity Investment Profile (Latent) (Latent) (Latent) | | | | +-----+ | v v v Instructional Productivity Cost <-------- (Latent) (Latent)

Latent Variable Indicators

Latent VariableMeasured IPEDS Indicators
Research IntensityResearch exp/T/TT, % T/TT faculty, Carnegie R-classification
Faculty Investment% T/TT, % full-time, average faculty salary
Enrollment ProfileFTE total, % graduate, degrees per 100 FTE
Instructional CostCost per FTE, personnel %, instruction % of E&G
ProductivityStudents per faculty, degrees per faculty

What You Get

DeliverableDescription
Path diagramPublication-quality diagram with standardized path coefficients on every arrow. The visual tells the full story.
Direct effectsResearch intensity → cost: β = 0.34. "A one standard deviation increase in research intensity directly increases cost by 0.34 SD."
Indirect effectsResearch → faculty mix → cost: β = 0.18. "Research intensity also increases cost indirectly by changing faculty composition."
Total effectsDirect + indirect = 0.52. "The total impact of research intensity on cost is larger than either path alone."
Model fit indicesCFI ≥ 0.95, RMSEA ≤ 0.06, SRMR ≤ 0.08 — publication-standard reporting.
Measurement invarianceDoes the model work the same way for public vs. private institutions? We test configural, metric, scalar, and strict invariance. Dr. Khojasteh literally wrote the book on this (Khojasteh & Lo, 2015).
Latent scoresEach institution gets estimated scores on each latent construct — your "research intensity score," "productivity score," etc.

Why SEM over regression: Regression treats all predictors as independent causes. SEM models the relationships among predictors. Research intensity doesn't just predict cost — it changes faculty mix, which changes workload, which changes cost. SEM captures this full causal chain. It also handles measurement error through latent variables, producing less biased estimates.

R Implementation

library(lavaan) # Core SEM engine library(semPlot) # Path diagrams library(semTools) # Measurement invariance testing model_spec <- ' # Measurement model research =~ research_per_tt + pct_tt + carnegie_r faculty =~ pct_tt + pct_ft + avg_salary enroll =~ fte_total + pct_grad + deg_per_100 cost =~ cost_per_fte + personnel_pct + instr_pct_eg product =~ students_per_fac + degrees_per_fac # Structural model cost ~ research + faculty + product product ~ enroll + faculty faculty ~ research ' fit <- sem(model_spec, data = ipeds) summary(fit, standardized = TRUE, fit.measures = TRUE) # Invariance testing (public vs. private) measurementInvariance(model_spec, data = ipeds, group = "control")
Strategic Model 2

Cost Trajectory Forecasting

Latent Growth Curves Growth Mixture Models

A trend line shows where cost has been. A growth model tells you what trajectory class your institution belongs to and where it's headed. Based on the longitudinal SEM methods published by Dr. Khojasteh (Marcoulides & Khojasteh, 2018; Whittaker & Khojasteh, 2017).

The Question It Answers

Is your cost trajectory rising, stable, or declining? Are there distinct subpopulations of institutions following different paths? What predicts which path you're on? And does your enrollment trajectory co-evolve with your cost trajectory?

Three Progressively Richer Models

ModelWhat It Does
Latent Growth Curve (LGC)Estimates the average trajectory across all institutions (intercept = starting point, slope = rate of change) and individual variation around it. "The average institution's cost grew $312/year, but the standard deviation of slopes is $180 — there's huge variation."
Growth Mixture Model (GMM)Identifies distinct subpopulations following different trajectories. "We identified 3 trajectory classes: 'Rising' (38% of institutions, avg +$520/year), 'Stable' (45%, avg +$80/year), and 'Declining' (17%, avg -$200/year). Your institution belongs to the Rising class."
Parallel Process ModelModels cost and enrollment trajectories simultaneously. "Institutions whose enrollment declined by > 5% show cost increases of $800/year — the fixed cost structure doesn't shrink with enrollment."

What You Get

  • Trajectory class assignment: Which group you belong to, with probability of membership (e.g., "87% probability of being in the 'Rising' class")
  • Growth parameters: Your estimated intercept and slope with confidence intervals
  • Predictors of trajectory: What institutional characteristics predict membership in each class (Carnegie, control, enrollment size, faculty mix)
  • 3-year forecast: Projected cost trajectory with confidence intervals, based on your growth model parameters
  • Parallel process results: Whether your enrollment and cost trajectories are coupled or independent

Beyond trend lines: A simple trend line treats every institution's trajectory as the same shape (linear). Growth mixture models discover that institutions follow fundamentally different patterns — and knowing which pattern you're on changes the strategic response entirely.

Strategic Model 3

Program Prioritization Matrix

Composite Decision Matrix Multi-Criteria Analysis

This is the ultimate consulting deliverable — a data-driven recommendation engine for program investment and disinvestment. It combines efficiency scores, enrollment demand, and strategic value into a single decision framework.

The Framework

High Demand / Growing Market | INVEST | SUSTAIN (grow, | (optimize, fund) | reduce cost) | High Efficiency -----------+----------- Low Efficiency | MONITOR | RESTRUCTURE (niche | (consolidate, value) | sunset) | Low Demand / Declining Market

How Each Axis Is Quantified

AxisComponentsData Source
Cost Efficiency (X)DEA score + regression residual + cost-per-FTE percentile within CIPModels 1 & 3
Demand Signal (Y)5-year enrollment growth rate + completion rate trend + BLS occupation projectionsIPEDS + Bureau of Labor Statistics
Strategic Value (Z)Mission alignment + accreditation requirements + cross-subsidy role + institutional distinctivenessClient input (scored rubric)

What You Get

  • Interactive quadrant plot: Each CIP-code program plotted with drill-through to supporting evidence
  • Action recommendations: Specific guidance for each quadrant — not just "restructure" but "consider consolidating CIP 42.01 with CIP 42.27; combined program would move from Restructure to Monitor quadrant"
  • Scenario modeling: "If you sunset programs X and Y and redirect resources to Z, your overall institutional efficiency score improves by 0.08"
  • Board-ready visualization: Suitable for presentation to governing boards, with clear legend and interpretive notes
Strategic Model 4

Faculty Workload Simulation

Monte Carlo Simulation What-If Modeling

Faculty compensation is the largest component of instructional cost (typically 80-90%). Every retirement, every new hire, every adjunct-to-lecturer conversion changes your cost structure. This simulation engine lets you model those changes before making them.

The Question It Answers

"What if we replace 5 retiring tenured faculty with 3 lecturers and 4 adjuncts? What happens to our cost per FTE, our student-faculty ratio, our percentile rank among peers, and our trajectory over 5 years?"

Input Parameters

ParameterSource
Faculty retirements by type and yearClient-provided or actuarial estimate
New hires by type (T/TT, lecturer, adjunct)Client scenario input
Salary by faculty typeIPEDS HR data or client-provided
Benefit rates by typeClient-provided or national average
Teaching capacity by type (SCH/FTE)Estimated from current SCH/FTE ratios
Enrollment projectionClient-provided or growth model forecast

What You Get

  • What-if calculator: Interactive Power BI parameter page — slide faculty mix percentages, see cost impact in real time
  • 5-year projection: Cost trajectory under different faculty replacement scenarios, with peer percentile tracking
  • Break-even analysis: How many adjuncts replace one T/TT position at equal cost? At equal SCH output? (These are different numbers.)
  • Peer comparison overlay: Where would the simulated faculty mix place you relative to current peers? "Replacing 5 T/TT with lecturers moves you from 38th to 29th percentile on cost/FTE but from 48th to 35th on % T/TT."
  • Monte Carlo uncertainty: Salary and enrollment projections have uncertainty. We run 1,000 simulations with randomized parameters to show the range of possible outcomes, not just a point estimate.

The tradeoff made visible: Every faculty composition decision involves a cost-quality tradeoff. This simulation doesn't tell you what to do — it shows you the quantified consequences of each option so you can make an informed decision.