By Santiago Iñiguez de Onzoño, Executive President of IE University
Traditionally, the paradigm used to measure the impact of training on the individual is the Four Level Evaluation Model, created by Donald Kirkpatrick..[i]  Here I will deal with the pros and cons of this method. The four levels proposed by Kirkpatrick, or different evaluation areas as to the effectiveness of a program, which are ordered on the basis of their complexity, ease, or even whether it can be implemented.
This measures participants’ response and satisfaction. Normally, the effectiveness of this level is measured through surveys after the session or course, registering the opinion of participants about the teacher’s abilities, (e.g., communication, knowledge, attention paid to participants), and how interesting or useful the material was. This is the most commonly used approach, and possibly the one most valued by CLOs, who are under pressure to offer courses that capture participants’ attention.
Perhaps the biggest risk in giving such importance to evaluation surveys is that it may overrate the performance of professors: “star academics” and gurus tend to give very entertaining and interesting courses, even if there is little evidence of any improved learning experience in relation to effort or the development of certain skills. But the simple truth is that satisfaction surveys are necessary: participants on executive training programs must score at least 4.5 out of five, or 8.5 out of 10. We should remember that directors attend these types of programs while continuing to work, sometimes giving up their own leisure or family time, meaning that CLOs expect them to be both entertaining and informative. Poor results in satisfaction surveys do not just affect teachers of the institutions delivering the program, but also directly on the credibility of the CLO.
This focuses on what participants have learned on a course and looks at two main areas: knowledge and skills. In relation to knowledge, this tends to divide into concepts, models, and ideas, which can measured through role playing exercises, exams, quizzes, or tests, as well as the application of this knowledge in practice, for example through participation in projects or problem solving.
The second area of learning, related to skill acquisition, is harder to measure, particularly in the case of abilities related to our personalities, such as leadership. That said, there are abilities that improve over time and can be measured. For example, spoken and written communication, leading groups, time management, delegating, and others.
The third level of evaluation assesses the impact of training on participants’ behavior. The first difficulty with this approach is that changes to behavior only manifest themselves over time, whereas CLOs and other department heads are under pressure to obtain results in the short term.
As we’ll see elsewhere in this book, companies are under pressure to shorten the length of in-company training programs, which makes it harder for them to impact positively on participants’ behaviour, given that this depends on repeated exercises and messages, interiorising, and comparing with alternative situations, before and after. One way of prolonging the educational moment to bring about changes in behavior is by tracing the continuum between training and work, for example by running sessions or modules periodically over a period of say, a year after a course.
This last level refers to the combined results of a training program, which is to say on both the individual and the collective. These results tend to be of a qualitative nature, for example, improved morale, a better understanding of the company’s mission and values, or a greater willingness to focus on the clients’ or stakeholders’ needs. The effectiveness of training programs can also be measured through increased sales in a department that has attended a program, or the launch of a new business unit discussed or conceived during a program.
One thing is clear: there is no sense in expecting any immediate results from training or development programs. I remember the case of a middle-sized medical products company, a leader in its region, that commissioned my business school to come up with a year-long management skills program for directors considered to have potential, many of them in sales.
The CLO was expecting an increase in sales within a few months of the program starting, but this wasn’t going to happen, in large part due to the participants’ dedication to the program and its intensity. Nevertheless, the year after the program ended, sales rose by 25 percent on the previous year.
Added to which, more of those directors with potential stayed on, they identified with the company’s aims, and morale rose. Logically, given that this was only a medium-sized company, with less than 500 directors—although it had a growth and diversification plan—it was recommended that it repeat the program every two years, so as to avoid any incompatibility between dedication to the program and the sales increase that could impact on its results.
There are other ways to measure the impact of a training program on participants and companies. For example, a multinational that develops software measured the success of a program delivered by two well-known business schools on the basis of the number of participants that were promoted to vice president. [ii]  But this approach raises a number of doubts: was the program competing with others of a similar nature offered by other business schools? Such an approach would allow for measurement of several programs at the same time. On the other hand, were participants selected a priori for their potential to occupy vice presidential posts? If so, then the program could be a good platform for testing their abilities and leadership, and even to boost them.
That said, converting the program into a selection process for a limited number of people could distort the learning process, increase competitiveness and under value the results of those who weren’t promoted, but that could possibly continue making a valuable contribution to the company.
As I mentioned earlier, it is possible to create relatively sophisticated mechanisms within Kirkpatrick’s evaluation method. It is easier in levels one and two, as well as in relation to some of the results at the individual or collective stage in level four. Then there is the question of whether there is a direct causal link between the collective objectives that are reached and a specific training program. That said, level three—that of the impact on behavior— seems to raise the most problems: either programs need to be of significant duration, or participants need to be monitored for several years after they have completed the course. Longer programs need to include mentorship that support candidates with the best potential for senior positions.
A credible evaluation also requires the inclusion of other stakeholders in the design of a program, particularly senior management. This means clearly establishing the objectives of the program, how they fit in wit the company’s strategy; what the desire learning outcomes are; which methodologies to use, and of course monitoring mechanisms that analyse the impact of training over time, and not just in the immediate aftermath of delivery.
Although Kirkpatrick’s model made a good start in trying to measure the cost-effectiveness of training, and is still widely used, its limitations have also become clear. The main criticism is that it doesn’t answer such key questions as: “Are we doing the right thing, and are we doing it well?”. [iii]  It’s a simple enough model to use, and is one that provides a systematic tool to show the rest of the firm’s business units. What’s more, Level Four, which helps measure whether the company is achieving its objectives, is attractive to HR directors because it links their initiatives directly to the company’s bottom line.
Perhaps Kirkpatrick’s main contribution was to pave the way for other evaluation systems that delved deeper into each of the four levels, along with other approaches to measuring the impact of training, both on the individual and the company. A McKinsey survey carried out among HR managers on measuring the impact of training showed that this question is of considerable importance when deciding on which supplier to go with. That said, only 25 percent of those surveyed said that their programs were effective at improving performance measurably. On the other hand, 8 percent of those questioned said they already had methods for analyzing the return on investment in training. The survey concluded that for a program to have any real impact on the company’s activities, the curriculum should reflect its key business performance metrics. [iv] 
[i]  D. Kirkpatrick, Evaluating Training Programs. The Four Levels (San Francisco, CA: Berrett-Koehler Publishers Inc.), 3rd edition.
[iii]  R. Bats, “A critical analysis of evaluation practice: the Kirkpatrick model and the principle of beneficence”, Evaluation and Program Planning 27 (2004), 341-347;
[iv]  J. Cermak and M. McGurk, “Putting a value in training”, McKinsey Quarterly, July 2010