HR Analytics Should Recognize Potential and Pitfalls in Use of “Big Data”
Jan 12, 2015
A finding (or even allegation) of discriminatory conduct can be costly to a firm both in terms of financial settlement of claims as well as in potential damage to the firm’s reputation. Claims regarding discriminatory outcomes for certain subsets of individuals may arise not only from intentionally disparate treatment of existing or prospective employees, but also from unintentional disparate impact on different groups because of a particular hiring or compensation policy. As a result, firms that are developing and implementing data-driven approaches to their HR functions should understand the implications of their methods and ensure these methods are not only based on sound business practices, but also avoid unintended effects on particular groups.
At a workshop on the implications of “Big Data” for employment issues last autumn, the Equal Employment Opportunity Commission (“EEOC”) outlined its current position with respect to the application of anti-discrimination statues to providing a check on “discriminatory uses of big data in employee recruitment and screening.” The EEOC made two particular points that HR analytics practitioners should keep in mind:
- Firms should be “keeping detailed records of how they are using data [to] inform their recruitment strategies and hiring decisions;” and
- Predictive models used for making hiring, promotion, and compensation decisions may be problematic if they have “a disparate impact that is not offset by business necessities or an applicant’s ability to perform a job-related task.”
A sound modeling approach should address these types of issues at the outset of model development, rather than at the time of a government audit or when a discrimination lawsuit has been filed.
Predictive data-driven models used in the HR function consist of three primary components: (i) data; (ii) business factors, implemented in the form of algorithms that process the data; and (iii) results produced by the algorithms. A thorough assessment of these components should take place before the model is run and its results are implemented in the firm’s decision making.
HR analytics models may rely on a variety of data sources that must be integrated. Some of these data sources may contain information specific to individual employees at the firm (e.g., from personnel and compensation systems). Others may contain characteristics of individual prospective candidates (e.g., from applicant flow logs), while still others may contain metrics that are not specific to an individual or even the firm (e.g., benchmarks on population characteristics from public sources). These data are often imperfect (as sometimes employee and candidate characteristics cannot be measured easily or accurately), and it is important to thoroughly evaluate them before they are analyzed. However, HR analytics practitioners should avoid simply assuming—without the support of appropriate testing—that a data source is inherently biased in a way that will necessarily lead to (or obscure) disparate outcomes for protected classes.
The model’s algorithms should be based on a sound understanding of the business and a theory of what factors drive the variable of interest (e.g., the likelihood of success in a particular role, or the probability of retention). However, to appropriately model these processes, such evaluations should be made when constructing the model, not after results have been reviewed. It makes no sense, for example, to make decisions about which employee characteristics lead to higher performance without having a theory of what factors are likely to influence performance prior to executing the analysis. Rather, managers and business people with institutional knowledge should inform the ex ante theory, which can then be tested with the data-driven model.
In general, data mining in HR analytics without a sound a priori theory may be a risky practice. If a firm is found to have relied on a practice that had disparate impacts on protected classes, that firm could be required to provide valid business justifications for the disparate results. In many cases, these justifications may be seen as controversial (or even pretextual), which can be an unnecessary complication if the justification is indeed legitimate. However, if the model development is guided by a clear theory of what factors matter to the business (before results of the model are observed), the explanations of potentially disparate results may be easier to justify to auditors and courts. Additionally, predictive models should be “back-tested” to evaluate their validity. This process entails testing how the model predicts already-known results and allows an assessment of the possible biases generated by the input data or the modeling process.
Practitioners and regulators must also be able to appropriately interpret their models’ results to derive value from them. For example, EEOC’s position that an analytic tool may be “illegal if it doesn’t accurately predict the success of an individual at a job” is misguided. While analytic methods may rely on large and complex data systems to determine statistical relationships between various factors, they are not likely to accurately predict the success of each particular individual. This is a false target because models are simplified representations of real-life phenomena and do not (and are not intended to) explain every nuance of real-world events. Rather, practitioners and regulators should focus on broader outcomes to determine whether a model provides useful and appropriate results.
Analytics of large and complex data can provide a useful framework for the application of data-driven methods to a firm’s HR function and ground business decision-making in defensible statistical analyses. However, HR analytics practitioners must apply these techniques in a thoughtful and rigorous manner, just as it is incumbent on regulators to understand what “Big Data” models can and cannot contribute to an analysis of discrimination.