Statistical Models

Explanatory Model 1

Cost Drivers Analysis

Multiple Regression Dominance Analysis

The foundational question: what predicts instructional cost per FTE student after controlling for institutional characteristics? This model identifies which factors your institution can influence and which are structural constraints.

The Question It Answers

Is your instructional cost high because of factors you control (faculty mix, program composition) or factors you don't (state cost of living, Carnegie classification)? The regression residual tells you exactly how much your actual spending deviates from what the model predicts for an institution with your profile.

Model Specification

Cost_per_FTE = β₀ + β₁(FTE Enrollment) ← scale effect + β₂(% Graduate Students) ← program mix + β₃(% Tenured/Tenure-Track) ← faculty investment + β₄(% Full-Time Faculty) ← faculty structure + β₅(Research Exp per T/TT) ← research intensity + β₆(Carnegie Classification) ← institutional type + β₇(Public/Private Control) ← funding model + β₈(Urbanization) ← cost of living proxy + ε

What You Get

Deliverable	Description
Coefficient table	Which factors significantly predict cost, with direction and magnitude. A one-unit increase in % T/TT faculty predicts a $X increase in cost/FTE.
Partial R² decomposition	How much variance each predictor explains, using Shapley values (dominance analysis). Answers: "Faculty mix explains 28% of cost variation; research intensity explains 19%."
Residual analysis	Your residual = actual cost minus predicted cost. Positive = spending more than expected. Negative = running lean. This is your efficiency signal.
Predicted vs. actual plot	Visual showing where your institution falls relative to the regression line among all peers.
Scenario analysis	"If you shifted faculty mix by X%, your predicted cost would change by $Y." Actionable what-if modeling grounded in the regression coefficients.

Assumptions We Test

Linearity (partial regression plots)
Homoscedasticity (Breusch-Pagan test)
Multicollinearity (VIF < 5 for all predictors)
Normality of residuals (Q-Q plot, Shapiro-Wilk)
Influential observations (Cook's distance — feeds into Outlier Detection model)

Key insight: The original Delaware Cost Study reported only descriptive benchmarks (means, quartiles). Our regression model goes further by answering why costs differ — separating controllable factors from structural ones.

R Implementation

library(car) # VIF, diagnostic plots library(relaimpo) # Dominance analysis (Shapley values) library(performance) # Model diagnostics library(ggplot2) # Visualization model <- lm(cost_per_fte ~ enrollment + pct_grad + pct_tt + pct_ft + research_per_tt + carnegie + control + urbanization, data = ipeds) # Dominance analysis calc.relimp(model, type = "lmg", rela = TRUE)

Explanatory Model 2

State & Sector Context

Multilevel Modeling (HLM) Random Effects

Institutions exist within states that have fundamentally different funding models, cost of living, and regulatory environments. A standard regression treats every institution as independent — but a public university in Mississippi and one in California operate in very different cost contexts. Multilevel modeling accounts for this nesting structure.

The Question It Answers

How much of your cost is driven by state-level factors (things you can't control) versus institution-level decisions (things you can)? And does the relationship between, say, faculty mix and cost differ depending on what state you're in?

Model Structure

Level 1 (Institution i within State j): Cost_per_FTE_ij = β_0j + β₁(% T/TT)_ij + β₂(Research Intensity)_ij + β₃(Enrollment)_ij + r_ij Level 2 (State j): β_0j = γ₀₀ + γ₀₁(State Appropriation per FTE)_j + γ₀₂(Median Faculty Salary in State)_j + u_0j

What You Get

Deliverable	Description
Intraclass Correlation (ICC)	The percentage of cost variation that is between states vs. within states. If ICC = 0.25, then 25% of cost differences are explained by which state you're in.
State random effects	A ranked list of all 50 states showing their cost premium or discount after controlling for institutional characteristics. "Being in California adds $1,200 to predicted cost/FTE."
Cross-level interactions	Does the effect of faculty mix on cost differ by state funding level? In high-appropriation states, adding T/TT faculty might cost less because state subsidies offset it.
Contextual effects	Separating within-state effects from between-state effects of the same predictor. The meaning of "high % T/TT" may differ depending on whether you're comparing within your state or across states.
Caterpillar plot	Visual ranking of all state random intercepts with 95% confidence intervals — instantly shows which states are significantly above or below average.

Why this matters: If your cost is high but your state random effect is also high, the "problem" may be geographic, not institutional. This reframes the conversation from "we're inefficient" to "we're operating in an expensive state, and here's how we compare to peers in the same context."

R Implementation

library(lme4) # Multilevel models library(lmerTest) # Significance tests for MLM library(sjPlot) # Caterpillar plots, coefficient tables library(performance) # ICC, model comparison model <- lmer(cost_per_fte ~ pct_tt + research_per_tt + enrollment + state_approp_per_fte + median_salary_state + (1 | state), data = ipeds) icc(model) # Intraclass correlation ranef(model) # State-level random effects

Explanatory Model 3

Efficiency Frontier

Data Envelopment Analysis Malmquist Index

Data Envelopment Analysis (DEA) is a non-parametric method that constructs an empirical efficiency frontier from the data. The original Delaware Cost Study cited DEA in multiple conference presentations (AIR 2018, ACE 2019) as the next-generation analytical method — but never implemented it. We build what they envisioned.

The Question It Answers

Given your inputs (expenditures, faculty), are you producing the maximum possible outputs (students served, degrees awarded)? If not, how far are you from the frontier, and which specific efficient institutions should you benchmark against?

How DEA Works

Input/Output	Variables
Inputs	Instructional expenditures, FTE instructional staff, % T/TT faculty
Outputs	FTE students served, total degrees awarded, research expenditures (for R1s)

DEA Variants We Run

Variant	Purpose
CRS (Constant Returns to Scale)	Overall technical efficiency — are you on the frontier regardless of size?
VRS (Variable Returns to Scale)	Scale-adjusted efficiency — are you efficient for your size? Small colleges aren't penalized for not operating at R1 scale.
Scale Efficiency	CRS score / VRS score — are you operating at optimal scale? If not, should you grow or shrink?
Super-efficiency	Ranks institutions on the frontier against each other. Useful when multiple institutions score 1.0.
Malmquist Productivity Index	Tracks efficiency change over time, decomposed into: (a) efficiency change (catching up to the frontier) and (b) technical change (the frontier itself shifting).

What You Get

Efficiency score (0 to 1): 1.0 = on the frontier. 0.82 = you could produce the same outputs with 18% fewer inputs (or 18% more outputs with the same inputs).
Reference peer set: The specific efficient institutions that form your benchmark. "Your reference peers are Iowa State, Kansas State, and Missouri — here's what they do differently."
Slack analysis: Exactly how much of each input to reduce or output to increase to reach the frontier. "Reduce non-personnel expenditures by $420K while maintaining current enrollment."
Frontier visualization: 2D projection showing the frontier curve and your institution's position.
Malmquist decomposition: "Your efficiency improved 4% from 2020 to 2024, but the frontier shifted outward 6%, so your relative position actually declined."

Why DEA over regression: Regression estimates an average relationship. DEA identifies the best practice frontier. An institution can be average by regression standards but far from the efficiency frontier. DEA also handles multiple inputs and outputs simultaneously without assuming a linear relationship.

R Implementation

library(Benchmarking) # Core DEA library(deaR) # Alternative with Malmquist library(productivity) # Malmquist Productivity Index # Input-oriented VRS model dea_result <- dea(X = inputs, Y = outputs, RTS = "vrs", ORIENTATION = "in") # Malmquist index (requires panel data) malm <- malmquist(X, Y, id, time, orientation = "input")

Explanatory Model 4

True Peer Groups

K-Means Clustering Hierarchical Clustering Mahalanobis Distance

Carnegie classification is the default peer grouping in higher education — but it's a blunt instrument. Two "Doctoral: Very High Research" institutions can have wildly different cost structures, faculty compositions, and enrollment profiles. Research from the original Cost Study's own presentations (NEAIR 2021) demonstrated that data-driven peer groups outperform Carnegie-based ones for benchmarking.

The Question It Answers

Who are your real peers based on your actual instructional cost and productivity profile? And how do you perform within that true peer group?

Method

Feature selection: Standardize all 8 core metrics (z-scores) so no single metric dominates due to scale differences
Optimal k determination: Run silhouette analysis, gap statistic, and elbow method to find the natural number of clusters — not arbitrarily chosen
K-means clustering: Partition institutions into k groups that minimize within-cluster variance
Hierarchical clustering: Run as validation — do both methods produce similar groupings? (Ward's method with Euclidean distance)
Cluster profiling: Characterize each cluster by its metric signature using radar/spider charts

What You Get

Deliverable	Description
Cluster assignment	Your institution is placed in a named, characterized peer group (e.g., "Research-Active, Teaching-Heavy" or "Low-Cost, High-Throughput").
Cluster profiles	Radar charts showing the metric signature of each cluster. Instantly see what makes your group distinctive.
Carnegie cross-tabulation	How do data-driven groups differ from Carnegie classification? Some Carnegie classes may split into multiple clusters; others may merge.
Within-cluster benchmarks	Your rank on each metric within your data-driven peer group — more meaningful than Carnegie-wide comparisons.
10 nearest neighbors	The institutions most statistically similar to yours across all dimensions (Mahalanobis distance). Your true comparators.

Example finding: "Oklahoma State is classified as 'Doctoral: Very High Research' by Carnegie, but our cluster analysis places it in a group of 47 institutions we call 'Research-Active, Teaching-Heavy' — high research expenditure combined with high student-to-faculty ratios. This is a distinct cost profile from the typical R1 pattern."

R Implementation

library(factoextra) # Visualization (silhouette, fviz_cluster) library(cluster) # PAM, silhouette library(NbClust) # 30 indices for optimal k library(fpc) # Cluster validation statistics # Optimal k nb <- NbClust(scaled_data, method = "kmeans", min.nc = 2, max.nc = 12) # K-means km <- kmeans(scaled_data, centers = optimal_k, nstart = 25) # Nearest neighbors mahal_dist <- mahalanobis(scaled_data, center = target_inst, cov = cov(scaled_data))

Explanatory Model 5

Distribution Analysis

Quantile Regression Conditional Distributions

Standard regression tells you what predicts average cost. But the factors driving cost for the median institution may be entirely different from those driving cost at the 90th percentile. Quantile regression estimates separate models at different points in the cost distribution.

The Question It Answers

Do the same factors matter equally across the cost spectrum? Is research intensity irrelevant for low-cost institutions but a dominant factor for high-cost ones?

Model

Q_τ(Cost_per_FTE | X) = β₀(τ) + β₁(τ)(% T/TT) + β₂(τ)(Research) + ... where τ ∈ {0.10, 0.25, 0.50, 0.75, 0.90} Each τ produces a different set of coefficients.

What You Get

Deliverable	Description
Quantile coefficient plots	Line plots showing how each predictor's coefficient changes across τ = 0.10 to 0.90. Where the line is flat, the effect is constant. Where it rises or falls, the effect depends on where you sit in the cost distribution.
Conditional distribution	Given your institution's characteristics, the full predicted distribution of cost — not just a point estimate. "Institutions with your profile range from $6,800 (10th percentile) to $12,400 (90th percentile)."
Tail analysis	What drives the most expensive institutions? If the 90th percentile coefficient for research intensity is 3x the median coefficient, research buyouts are disproportionately driving costs at the top.

Real-world value: A provost asking "why is our cost high?" gets a fundamentally different answer from quantile regression than from OLS. If the OLS model says faculty mix explains 28% of cost variation, the quantile model might show that faculty mix explains 40% of variation at the 90th percentile but only 15% at the 25th percentile.

R Implementation

library(quantreg) # Core quantile regression library(ggplot2) # Coefficient plots # Estimate at 5 quantiles taus <- c(0.10, 0.25, 0.50, 0.75, 0.90) qr_model <- rq(cost_per_fte ~ pct_tt + research_per_tt + enrollment + pct_grad, tau = taus, data = ipeds) summary(qr_model) plot(qr_model) # Coefficient plots across quantiles

Explanatory Model 6

Anomaly Detection

Mahalanobis Distance Cook's Distance DBSCAN

Before acting on any benchmarking result, you need to know whether your institution is an outlier — and if so, why. We apply four complementary methods because each catches different types of anomalies.

Four Methods, One Answer

Method	What It Catches	How It Works
Mahalanobis distance	Multivariate outliers	Measures how far an institution is from the center of the data across all metrics simultaneously, accounting for correlations. An institution can look normal on every single metric but be unusual in the combination.
Cook's distance	Influential observations	From the regression model — identifies institutions that disproportionately affect the regression results. Removing this institution would change the coefficients significantly.
DBSCAN	Density-based outliers	Finds institutions that don't belong to any natural cluster. Unlike Mahalanobis (which assumes a single center), DBSCAN works with arbitrary shapes and identifies "noise points."
IQR fencing	Per-metric outliers	Simple, transparent flagging for the scorecard. Any metric beyond 1.5 × IQR from the median gets flagged. Easy for stakeholders to understand.

What You Get

Outlier flag column in every report (per-metric and multivariate)
Diagnostic narrative for each flagged institution: which metrics are unusual, how unusual, and what characteristics you share with other outliers
Population-level report: "These 15 institutions have cost/FTE > 1.5 IQR above their Carnegie median — here's what they have in common (high research intensity, low enrollment, private control)"
Decision guidance: "Your outlier status on X metric is driven by Y — this may be intentional (mission-driven) or a signal to investigate"

Why four methods: No single outlier detection method is complete. Mahalanobis misses non-elliptical clusters. Cook's only works with the regression model. DBSCAN requires choosing epsilon. IQR is univariate only. Together, they provide a comprehensive picture.

Strategic Model 1

Causal Pathway Modeling

Structural Equation Modeling Measurement Invariance

This is the most sophisticated model in our toolkit — and it's led by Dr. Jam Khojasteh, Associate Editor of the Structural Equation Modeling: A Multidisciplinary Journal, with 40+ publications in SEM and related methods. SEM doesn't just identify what predicts cost — it maps the full web of causal pathways showing how factors relate to each other in producing cost.

The Question It Answers

How do research intensity, faculty investment, and enrollment profile interact to produce instructional cost? What are the direct effects (research → cost) versus the indirect effects (research → faculty mix → workload → cost)? Does the model work the same way for public and private institutions?

Model Structure

Institutional Resources (Latent) | +-----------+-----------+ | | | v v v Research Faculty Enrollment Intensity Investment Profile (Latent) (Latent) (Latent) | | | | +-----+ | v v v Instructional Productivity Cost <-------- (Latent) (Latent)

Latent Variable Indicators

Latent Variable	Measured IPEDS Indicators
Research Intensity	Research exp/T/TT, % T/TT faculty, Carnegie R-classification
Faculty Investment	% T/TT, % full-time, average faculty salary
Enrollment Profile	FTE total, % graduate, degrees per 100 FTE
Instructional Cost	Cost per FTE, personnel %, instruction % of E&G
Productivity	Students per faculty, degrees per faculty

What You Get

Deliverable	Description
Path diagram	Publication-quality diagram with standardized path coefficients on every arrow. The visual tells the full story.
Direct effects	Research intensity → cost: β = 0.34. "A one standard deviation increase in research intensity directly increases cost by 0.34 SD."
Indirect effects	Research → faculty mix → cost: β = 0.18. "Research intensity also increases cost indirectly by changing faculty composition."
Total effects	Direct + indirect = 0.52. "The total impact of research intensity on cost is larger than either path alone."
Model fit indices	CFI ≥ 0.95, RMSEA ≤ 0.06, SRMR ≤ 0.08 — publication-standard reporting.
Measurement invariance	Does the model work the same way for public vs. private institutions? We test configural, metric, scalar, and strict invariance. Dr. Khojasteh literally wrote the book on this (Khojasteh & Lo, 2015).
Latent scores	Each institution gets estimated scores on each latent construct — your "research intensity score," "productivity score," etc.

Why SEM over regression: Regression treats all predictors as independent causes. SEM models the relationships among predictors. Research intensity doesn't just predict cost — it changes faculty mix, which changes workload, which changes cost. SEM captures this full causal chain. It also handles measurement error through latent variables, producing less biased estimates.

R Implementation

library(lavaan) # Core SEM engine library(semPlot) # Path diagrams library(semTools) # Measurement invariance testing model_spec <- ' # Measurement model research =~ research_per_tt + pct_tt + carnegie_r faculty =~ pct_tt + pct_ft + avg_salary enroll =~ fte_total + pct_grad + deg_per_100 cost =~ cost_per_fte + personnel_pct + instr_pct_eg product =~ students_per_fac + degrees_per_fac # Structural model cost ~ research + faculty + product product ~ enroll + faculty faculty ~ research ' fit <- sem(model_spec, data = ipeds) summary(fit, standardized = TRUE, fit.measures = TRUE) # Invariance testing (public vs. private) measurementInvariance(model_spec, data = ipeds, group = "control")

Strategic Model 2

Cost Trajectory Forecasting

Latent Growth Curves Growth Mixture Models

A trend line shows where cost has been. A growth model tells you what trajectory class your institution belongs to and where it's headed. Based on the longitudinal SEM methods published by Dr. Khojasteh (Marcoulides & Khojasteh, 2018; Whittaker & Khojasteh, 2017).

The Question It Answers

Is your cost trajectory rising, stable, or declining? Are there distinct subpopulations of institutions following different paths? What predicts which path you're on? And does your enrollment trajectory co-evolve with your cost trajectory?

Three Progressively Richer Models

Model	What It Does
Latent Growth Curve (LGC)	Estimates the average trajectory across all institutions (intercept = starting point, slope = rate of change) and individual variation around it. "The average institution's cost grew $312/year, but the standard deviation of slopes is $180 — there's huge variation."
Growth Mixture Model (GMM)	Identifies distinct subpopulations following different trajectories. "We identified 3 trajectory classes: 'Rising' (38% of institutions, avg +$520/year), 'Stable' (45%, avg +$80/year), and 'Declining' (17%, avg -$200/year). Your institution belongs to the Rising class."
Parallel Process Model	Models cost and enrollment trajectories simultaneously. "Institutions whose enrollment declined by > 5% show cost increases of $800/year — the fixed cost structure doesn't shrink with enrollment."

What You Get

Trajectory class assignment: Which group you belong to, with probability of membership (e.g., "87% probability of being in the 'Rising' class")
Growth parameters: Your estimated intercept and slope with confidence intervals
Predictors of trajectory: What institutional characteristics predict membership in each class (Carnegie, control, enrollment size, faculty mix)
3-year forecast: Projected cost trajectory with confidence intervals, based on your growth model parameters
Parallel process results: Whether your enrollment and cost trajectories are coupled or independent

Beyond trend lines: A simple trend line treats every institution's trajectory as the same shape (linear). Growth mixture models discover that institutions follow fundamentally different patterns — and knowing which pattern you're on changes the strategic response entirely.

Strategic Model 3

Program Prioritization Matrix

Composite Decision Matrix Multi-Criteria Analysis

This is the ultimate consulting deliverable — a data-driven recommendation engine for program investment and disinvestment. It combines efficiency scores, enrollment demand, and strategic value into a single decision framework.

The Framework

How Each Axis Is Quantified

Axis	Components	Data Source
Cost Efficiency (X)	DEA score + regression residual + cost-per-FTE percentile within CIP	Models 1 & 3
Demand Signal (Y)	5-year enrollment growth rate + completion rate trend + BLS occupation projections	IPEDS + Bureau of Labor Statistics
Strategic Value (Z)	Mission alignment + accreditation requirements + cross-subsidy role + institutional distinctiveness	Client input (scored rubric)

What You Get

Interactive quadrant plot: Each CIP-code program plotted with drill-through to supporting evidence
Action recommendations: Specific guidance for each quadrant — not just "restructure" but "consider consolidating CIP 42.01 with CIP 42.27; combined program would move from Restructure to Monitor quadrant"
Scenario modeling: "If you sunset programs X and Y and redirect resources to Z, your overall institutional efficiency score improves by 0.08"
Board-ready visualization: Suitable for presentation to governing boards, with clear legend and interpretive notes

Strategic Model 4

Faculty Workload Simulation

Monte Carlo Simulation What-If Modeling

Faculty compensation is the largest component of instructional cost (typically 80-90%). Every retirement, every new hire, every adjunct-to-lecturer conversion changes your cost structure. This simulation engine lets you model those changes before making them.

The Question It Answers

"What if we replace 5 retiring tenured faculty with 3 lecturers and 4 adjuncts? What happens to our cost per FTE, our student-faculty ratio, our percentile rank among peers, and our trajectory over 5 years?"

Input Parameters

Parameter	Source
Faculty retirements by type and year	Client-provided or actuarial estimate
New hires by type (T/TT, lecturer, adjunct)	Client scenario input
Salary by faculty type	IPEDS HR data or client-provided
Benefit rates by type	Client-provided or national average
Teaching capacity by type (SCH/FTE)	Estimated from current SCH/FTE ratios
Enrollment projection	Client-provided or growth model forecast

What You Get

What-if calculator: Interactive Power BI parameter page — slide faculty mix percentages, see cost impact in real time
5-year projection: Cost trajectory under different faculty replacement scenarios, with peer percentile tracking
Break-even analysis: How many adjuncts replace one T/TT position at equal cost? At equal SCH output? (These are different numbers.)
Peer comparison overlay: Where would the simulated faculty mix place you relative to current peers? "Replacing 5 T/TT with lecturers moves you from 38th to 29th percentile on cost/FTE but from 48th to 35th on % T/TT."
Monte Carlo uncertainty: Salary and enrollment projections have uncertainty. We run 1,000 simulations with randomized parameters to show the range of possible outcomes, not just a point estimate.

The tradeoff made visible: Every faculty composition decision involves a cost-quality tradeoff. This simulation doesn't tell you what to do — it shows you the quantified consequences of each option so you can make an informed decision.

Statistical Models & Methods

Cost Drivers Analysis

The Question It Answers

Model Specification

What You Get

Assumptions We Test

R Implementation

State & Sector Context

The Question It Answers

Model Structure

What You Get

R Implementation

Efficiency Frontier

The Question It Answers

How DEA Works

DEA Variants We Run

What You Get

R Implementation

True Peer Groups

The Question It Answers

Method

What You Get

R Implementation

Distribution Analysis

The Question It Answers

Model

What You Get

R Implementation

Anomaly Detection

Four Methods, One Answer

What You Get

Causal Pathway Modeling

The Question It Answers

Model Structure

Latent Variable Indicators

What You Get

R Implementation

Cost Trajectory Forecasting

The Question It Answers

Three Progressively Richer Models

What You Get

Program Prioritization Matrix

The Framework

How Each Axis Is Quantified

What You Get

Faculty Workload Simulation

The Question It Answers

Input Parameters

What You Get