Talent analytics is a relatively new area of focus to researchers working in analytics and data science. Talent Analytics has the potential to help companies make many informed critical decisions around talent acquisition, promotion and retention. This work investigates data science to predict “shiny star” employees in the U.S. public sector, defined as top-notch performers over the years of a given time span. Its scope falls within talent analytics, also called people analytics, a relatively new research area.

We clean a data set made available by the U.S. Office of Personnel Management (OPM) and present two models to predict the likelihood of success for federal agencies employees: a stepwise logistic regression (logreg) model and a stochastic gradient boosting machines (gbm) model. The definition of success varies depending on the 7 different ways we have developed the target variable. For both models (logreg and gbm), common high predictors of a “shiny star” are change in Grade and whether the employee is a Supervisor at the end of a given time span. A refined version of these models indicates that a common high predictor of a “shiny star” is whether the employee holds a bachelor’s degree.

A special challenge arises when information that attributes a pay raise with certainty to employees (e.g. PseudoID field) is suppressed in recent submissions (data post Q2 2014). For this period, we assign our own unique PseudoID fields for employees based on 2 uniqueness criteria: a stringent uniqueness criteria (7-9 fields) and a relaxed one (3-4 fields). We find that relaxing the uniqueness criteria allows to capture more events overall in the top 3 deciles of the models developed when scored against this period’s data set than the stringent uniqueness criteria does.

Taking the average percentage of events captured in the top 3 deciles as a metric to determine the the champion and runner-up models across all 7 target variable implementations, our study finds that for both the regular and refined version of the models, following a stringent uniqueness criteria (on the test set where PseudoID field is suppressed) the champion model, Rim2 logerg, captures close to 80% on average in the top 3 deciles, while following a relaxed uniqueness criteria the champion model, Top 5% gbm, captures 78% to 82% on average in the top 3 deciles.

Finally, we employ unsupervised learning techniques (association rule mining and clustering) during different time spans to explore characteristics of employees the champion model (Top 5% gbm) correctly identified as shiny stars in the top 3 deciles. Contributions of this work include a promotion and firing model for employees in the U.S. public sector. The champion models can be used to predict top-notch performers over the years of a given time span. In addition, this work is a detailed systematic investigation of data science and big-data techniques applied to the area of talent analytics.

Degree Date

Fall 2019

Document Type


Degree Name



Engineering Management, Information, and Systems


Aurélie Thiele

Number of Pages




Creative Commons License

Creative Commons Attribution-Noncommercial 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License