In comment #9 at this post, Susan makes a kind of canonical case I've heard from lots of assessment people.

First, I should say that I agree with 95% of the intended answers to Susan's rhetorical questions. We should be much clearer about what we want our students to get out of their degrees, and we should put in the hard work of assessing the extent that we are successful.

But "assessment" in contemporary American bureaucracies almost always accomplishes exactly the opposite of the laudable goals that Susan and I share. And there are deep systematic reasons for this. Below, I will first explain three fallacies and then explain why everyone involved in assessment faces enormous pressure to go along with these fallacies. Along the way I hope to make it clear how this results in "assessment" making things demonstrably worse.**


(1) THE STACK FALLACY

Anyone involved in any way with assessment must read Will Oremus' canonical slate dot com article on "stack ranking" as a method of assessment and how it would have completely ruined Microsoft if not for their ability to monopolistically rent seek. Then you need to ask yourself if your own administrative unit subtly involves stack ranking.

It is clear to me that the way student evaluation numbers are typically used involves stack ranking. At LSU (and many universities) your numbers are compared to the departmental average. This makes it a priori that half the department can be subject to criticism. The best teaching department in the universe will still be such that half will be judged below par. This is just fallacious. At Microsoft it's poisonous and harmful to the company in all of the ways Oremus adumbrates. The primary institutional pressure is to game your stack ranking and this completely undermines the kind of team-work environment necessary for good software design (the best coders are rarely the best human-computer interface people, and each of these skills subdivide in ways where teamwork is necessary). At Microsoft if you put yourself on a team with someone who might be more likely to be stack ranked higher than you by an administrator, you could easily be putting yourself out of a job. So the best coder has every incentive to work with the worst Human Computer Interface person! And this is just one example of the kind of destructive, perverse incentives that stack ranking causes.

Why would anything think stack ranking is good? It's because of the pervasive view that something can always be made better. But this is my argument:

  1. We are finite beings, at some point reflective equilibrium kicks in. This is where our endeavors with respect to some task are as good as they are going to get.
  2. If (as at Microsoft) there is an institutional demand to always change something to make it "better," the end result will be that those units at their reflective equilibrium will waste time changing things in a way that doesn't improve things and dishonestly pretend that an improvement has been made, or (and this is overwhelmingly likely at some point) actually make changes that make things worse.
  3. Any system of assessment must be such that an entire unit can be determined to be in optimal reflective equilibrium.

Again, stack ranking completely prohibits this, but in my experience almost all mandated assessment force you to waste time pretending to improve things because of the fallacious reasoning (which I'll call "the stack fallacy") explicated above.

This is actually a huge problem for software development. Microsoft now gives us progressively worse versions of Office every three to four years. Their monopolistic position ends up forcing us to purchase them, so it works out really well for their executives. But we can see now why the Microsoft suits pushed stack ranking; the very justification for putting out increasingly crappy versions of the same program (that everything can be improved, so something has to be changed) is the justification for stack ranking (two thirds of the unit can always be improved).

(2) THE MANAGEMENT MYTH FALLACY

The last fifty or so years of studying the mind have demonstrably shown that almost all basic human cognition, not to mention expertise, involves mastery of specific domains. The era of GPS ("Generalized Problem Solver") has passed. This is why getting an MBA in "management" makes absolutely no change to your lifetime income prospects. This is why there are more science majors and more liberal arts majors as CEOs of American companies than business majors. The myth of "management" is the myth that one can be an expert without any specific expertise. But any interesting human accomplishment is the result of actual expertise, either within the field in question or within some other field that can be seen to be analogous to the task at hand.

But late capitalism absolutely requires the management myth as justifying the predatory behavior of the 1% towards the rest of us.**** The Mitt Romneys deserve what they get because they can take any organization and make it better. This is of course hogwash.

But what happens in assessment? The reports are written to people with no competence in your field! What does someone in your local assessment office making you jump through all these hoops know about philosophy? Almost always far, far less than the undergraduates in you department. This is just crazy.

If the management myth were true, this would be no problem, but it is a myth. The management class thus has to maneuver their way around it via the next fallacy.

(3) THE QUANTITATIVE FALLACY

For creatures like us at least,***** assessment that works well is largely qualitative. People that we have independent reasons for thinking are discerning about the relevant good look at your syllabi and the tests, sit in on your classes, have conversations about what you are doing right and make some suggestions about how to improve things. Often it is helpful to write up a short narrative yourself about what you are trying, how you might try something different, and how this went.

None of this is algorithmic or at all susceptible to quantitative measures.

But the problem is that the management myth requires quantitative measures. How else would it be possible for someone with a degree in "Educational Leadership" to determine whether physics and philosophy professors are doing a good job assessing themselves.

So instead of doing assessment that actually could improve things, we spend I don't know how many hours each semester getting our numbers in, collating them, and entering them into whatever new interface has been foisted upon us. We spend a lot more hours justifying our quantitative system of assessment to the relevant suits and learning progressively less usable interfaces (because the assessment team buys heavily into the stack fallacy with respect to their own work, and in fact grows their office using it).

(4) ADMINISTRATIVE RENT SEEKING

Here's another secret. The stack, management, quantitative myth type assessment mania being forced upon colleges by accreditation agencies lies atop a lot of other assessment work that was much more helpful.

  1. At LSU we already have external review every five years, where a philosopher from a Carnegie ranked institution comes in and meets with a campus wide group of academics and assesses everything we do. This is a tremendous amount of work, we basically have to write a book to the committee about what we do and they write a book in response. We meet with the Provost to talk about what suggested changes are practicable.
  2. At LSU we already have annual review for faculty which includes faculty review of everyone's teaching.
  3. At LSU the tenure track are intensively assessed in the manner described above both for their third year review and tenure review.
  4. (Much less helpfully than the above three) at LSU we already have "strategic planning" documents that every unit has to fill out.

But our "assessment office" adds vastly more make-work on top of the work we're already doing, to no discernible effect other than mollifying SAACS (our accreditation agency).

Why is this kind of thing normal? If "assessment" was such a panacea, why not give us a break on the other stuff, which already consume an incredible amount of time?

The decision theory is very complicated, but social scientists Bruce Bueno de Mesquita and Alastair Smith have a pretty good explanation, popularized recently in The Dictator's Handbook: Why Bad Behavior is Almost Always Good Politics. In the book they use their explanation of how and why dictators thrive and survive to look at undemocratic aspects of non-dictatorships such as the way corporate boards are able to do things so at odds with the interests of their shareholders, the tremendous administrative bloat in higher education, and how and why our representatives are so non-responsive to public desire (for example, with respect to minimum wage and banking oversight).

It's pretty complicated (please read the book!) but their theory predicts that there is strong pressure for leaders to surround themselves with people who are easily replaceable, and that this is the main reason that so much service that used to be done by faculty is now being done by staff, and why these very services are growing exponentially as faculty numbers and pay isn't. 

This is a really important point for tenured faculty. We can behave like bulls in the china shop when dealing with staff, because our tenure robs us of insight into the overwhelming American culture of fear that results from the fact that people can be fired at will (combined with the reserve army of unemployed). We want to engage in spirited dialogue about the best solution and they take it as threatening. And it is. They can be put out on the street in two weeks. While a particularly nefarious administrator can fire tenured people simply by closing down a whole program, it's still something that takes a lot more work than letting obstreperous staff go.

In any case, De Mesquita and Smith show that there is nothing unique about the current Academy here. At every point in history and at every place on Earth there has been tremendous pressure for leaders to surround themselves with replaceable people who owe their sustenance and advancement entirely to the leader. So of course it would take enormous countervailing democratic pressure to get leaders to want to grow the number of tenure track positions.

So here the decision theoretic interests of administration (and I include accreditation agencies and state boards) really push them to accept the stack fallacy, the management myth fallacy, and the quantitative fallacy.

But every form of dictatorship is a bad thing for most people (De Mesquita and Smith show this clearly), whether the dictatorial norms are instantiated in a country or a business or an educational institution. To fight against "assessment" as it is currently practiced (where people have to pretend that the above three fallacies are not fallacious)  is the fight for democracy in our workplace. Unfortunately, like my former student described in footnote ** below, there are many, many staff at our institutions whose livelihood absolutely depend upon a lot of nonsense. As De Mesquita and Smith realize, this creates a very effective kind of blackmail with respect to anyone trying to improve anything.

But I can't bring myself to go along with transparent nonsense. I just can't. The point of education is to learn the truth. This should be a noble vocation and we have to be willing to say something when our institutions systematically undermine it.

[Notes:

*Please see PART I, in which I discourse generally about how the rhetoric of "excellence" works to increase administrative power in an undemocratic way that damages the institution. Please see PART II, where I investigate the rise of "TPS reports" type make-work in the academy and praise passive aggression as a response to it. PART III contains a story of the stack fallacy in action at LSU.

**I should note that in my experience arguing with people whose institutional position depends upon a lot of people accepting fallacious reasoning is almost always a waste of time. You will patiently explain some fallacy, and then they will respond just by using the very same fallacy.

For example, in one class where I was explaining the statistical fallacy of concluding that some high percentage of X's are also Y's means that X causes Y, and I gave as an example the "gateway theory" that drug warriors put forward. According to this theory, since some huge percentage of heroin addicts did p0t first, pot should be illegal, because using pot makes you more likely to use heroin.

 But the percentage of heroin users that earlier used pot is irrelevant. Note that 100% of full on heroin addicts did water first.***  The army person responsible for the military's role in the war on drugs for the state of Ohio (it's frightening that at least in the 90s every state had such an office) was actually in the room, and took extremely strong exception to the idea that the fallacy was a fallacy. But all of her responses used the very fallacy in question. She even brought a bunch of flyers with pictures of dead junkies and their organs post-autopsy, the flyers all justified the war on drugs against pot in terms of the "gateway theory." I ended up just letting her show everyone in the class the pictures. She kind of scared me.

***Does that mean that water is actually drug, as alleged by Spinal Tap bass player Derek Smalls? Feh.

****Consider Louisiana Governor Bobby Jindal, whose only competence is convincing otherwise sensible people that he is competent. I must say that he is very competent at this. Look for him soon at a presidential election near you.

*****Maybe God can understand everything in the language of mathematics. This isn't relevant to the best way for creatures like us to muddle our ways through.]

Posted in , , , , , , , , , , , , , , , , , , ,

10 responses to “Excellence, Schmexcellence Part IV*: an attempt to be clearer about the main problems with assessment.”

  1. Jeff Bell Avatar

    Nice post Jon. I particularly like your third footnote about Jindal’s only competence. One data point is that Microsoft has abandoned, as of last November, their stacked ranking system. Whether this will be the beginning of a sea change that leads to less assessments at the university level or not, who knows (I have my doubts), but at least in the case of MS they clearly saw the problems with it. Here’s one story on it:
    http://www.theverge.com/2013/11/12/5094864/microsoft-kills-stack-ranking-internal-structure

    Like

  2. ben w Avatar
    ben w

    “At LSU (and many universities) your numbers are compared to the departmental average. This makes it a priori that half the department can be subject to criticism. The best teaching department in the universe will still be such that half will be judged below par.”
    I think comparison against department average is silly, but it’s just not true, whether “average” means median or mean (or mode!), that half of the department will be below average.

    Like

  3. Jon Cogburn Avatar
    Jon Cogburn

    Isn’t “median” defined as that which half are above and half below?
    And I don’t think this is a case where median and mean are going to diverge hugely (as for example they do with respect to income levels in Louisiana).

    Like

  4. ben w Avatar
    ben w

    The median of all of the following is 5:
    – 1, 2, 3, 4, 5, 6, 7, 8, 9
    – 1, 2, 5, 5, 5, 5, 5, 8, 9
    – 5, 5, 5, 5, 5, 5, 5, 5, 5
    In none of those cases are half above and half below.

    Like

  5. Jon Cogburn Avatar
    Jon Cogburn

    Yes, but student evaluation scores are so finely grained (typically from zero to five with decimal points) that the distribution is always like the first one (with less spread). And that’s close enough for horseshoes.
    Just read what I wrote above and put a “nearly” in front of wherever I say that one half are above and one half below, and note that you never, never, never get the second and third cases, given that the numbers have to be specific enough for administrators to be able to commit the fallacy of false precision when thinking about them (necessitating decimal points). They do this by using mean. But as I noted there is absolutely no reason to think that mean and median diverge here.
    Importantly, my general claim is unaffected by any of this. It’s still a nefarious instance of stack ranking whenever someone is judged badly simply because they are below the departmental mean. It’s still the case that this makes it a priori in the overwhelming majority of cases that somewhere around half of the department will receive an unacceptable review as a result of this.
    Finally, “median” is defined differently in different statistical approaches. The key idea is to try to capture the midpoint of a distribution, which is what “median” means in normal English, as I used it above.

    Like

  6. Tony Chemero Avatar
    Tony Chemero

    Thanks for this Jon. Assessment culture is a ubiquitous and awful feature of higher ed these days. We should all fight back as much as we can. Every one of us would do well to read Ginsberg’s The Fall of the Faculty.
    Also, FIDLAR is awesome.

    Like

  7. Jon Cogburn Avatar
    Jon Cogburn

    Yes! FIDLAR is beyond awesome, like potentially AC/DC level of awesomeness if (big if) they can keep it up. I think I would take a leave of absence if they needed a guitar tech or even just another guy to unload things from the truck.

    Like

  8. Aaron Lercher Avatar
    Aaron Lercher

    This problem is difficult to discuss. For most people inside or outside academia, objections to assessment sound like special pleading for privilege.
    Even if we follow the reasoning behind assessment, it will be based on a lot of jumping to conclusions.
    Let’s agree on something to measure. Consider the ranking of a set of papers by number of citations at some time. (Or if you prefer Google PageRank, think of that.) Everyone who studies citations knows, and says repeatedly, that citation counts are not a substitute for informed judgment. But we’ll try it anyway.
    The distribution of citations is different for different fields. In economics, a large portion of papers are uncited. In biochemistry, only a few papers are uncited. The reasons are complicated and not well understood. In other words, Jon’s worries about averages are correct.
    Suppose we find a clever way to compare sets with different distributions: By inference to the best explanation, we conclude there’s a “family” of distributions with a parameter along which the different distributions are all in the same family. Then we try to get inductive conclusions by finding a measure of how “close” an observed distribution needs to be in order to infer that it is described by one of the distributions in the family.
    Even after that, we still have the problem of how to divide our sets of papers into different fields. That’s a qualitative question: Different fields have different ideas about what they are trying to do.
    Also, maybe I can suggest a neutral way of understanding the management myth. Once a group of people are appointed to a task, they will be shirking if they don’t do their job. But it can easily be true that by doing their job, they will be obstructing some other activity, even when the activity they obstruct is valuable.
    Maybe in some cases (Madison’s theory of US government) that’s what we want. But not always.
    It is not easy for me to write about this coolly. Also, it seems pointless to raise objections.

    Like

  9. Susan Avatar
    Susan

    Jon, thank you for taking my comment so seriously! I share many of your misgivings about assessment as it is often practiced in academia. If the goal is to determine how well faculty are doing at educating students, then artificially-constructed quantitative metrics won’t provide any useful answers. Judgments issued by those who lack expertise in the relevant discipline can be useful insofar as they address common concerns rather than specific content in the discipline, but they shouldn’t extend beyond that point. I believe that assessments designed and evaluated by the appropriate experts are valuable, however. The problem to me is not assessment per se, then, but the way it is often pursued. Since the assessment culture is here to stay at least for a while, I favor trying to reform it. Philosophers might be in an excellent position to point out flaws of currently-favored procedures and offer better models. I take Jon’s arguments here to be an example of attempting to change or influence the discourse on this subject for the better. I start from a different assumption, however: suppose we want to engage in assessment. What sort of assessment would we create, if nobody was telling us how it had to be done or judging the results?

    Like

  10. Robert McCall Avatar

    When I left school, I thought. Why should I go to university to learn what someone else thinks, and I couldn’t rationalize it. I thought I must learn to stand alone. Stack ranking, assessment is not for the student in mu mind but entirely for academia. They need to have a curriculum to churn out certificates and degrees, otherwise we´re able to learn alone. The multinationals are in on it, because they profit as do the banks and government. It makes sense – the labyrinth of power. Fascinating subject – great insight. Brilliant blog!

    Like

Leave a comment