The perils of dirty data and “overpriced” ham

So, we paid a government contractor $1.1 million for 2 pounds of sliced ham? That seemed to be the story as the Drudge Report started linking items from the federal government’s Recovery.gov site this morning.

aviary-drudgereport-com-picture-1

Not so fast, said the Agriculture Department, as it swatted down Drudge’s reports with a rare rebuttal.

Now, all this back-and-forth might seem a bit excessive for a few pounds of sliced ham, but it illustrates one of the perils of transparency without context. Our government spends billions each year to collect, maintain and analyze all kinds of data. But it’s collected by humans, who make mistakes.

Take the ham fiasco. Everything appears above board in the original description on Recovery.gov. But because of the “Description of Work/Service Performed” field, it looks like we paid a bunch of money for some pork. That’s not necessarily an error, but it’s definitely not clear to most readers. (I probably would have drawn the same conclusion, although I would check it out first before writing a story.)

The people who deal with government data on a regular basis know all too well the problems associated with collecting and disseminating data. In the field of computer-assisted reporting, we call it “dirty data,” and we’re on guard for it all the time. (A chunk of my time as Database Editor is spent cleaning up data we get from various local, state and federal sources.)

Here’s how the folks at the Institute for Analytic Journalism put it in 2006:

An uncountable number of public agency databases have been created in the past 30 years. More and more, public and private decision-makers draw on this collected, digital data to make decisions about everything from disciplining doctors to zoning decisions to law enforcement to deciding who gets to vote. The often-unquestioned assumption is that the data, as found, analyzed and presented by a government or quasi-government agency, is valid. Increasingly, anecdotal evidence indicates that data is riddled with serious errors. Often, if initial investigations indicate the data is too suspect — and the cost to clean the data by hand or automatically too high — then good and important analysis and investigations are put aside.

The Government Accountability Office recently put out its own report on the subject of government data. The report is mainly a guide for government auditors, but they recognized the problems of all these disparate sources of data, and the public’s appetite to put it all online.

While this guide focuses only on the reliability of data in terms of completeness and accuracy, other data quality considerations are just as important. In particular, consider validity. Validity (as used here) refers to whether the data actually represent what you think is being measured. For example, if we are interested in analyzing job performance and a field in the database is labeled “annual evaluation score,” we need to know whether that field seems like a reasonable way to gain information on a person’s job performance or whether it represents another kind of evaluation score.

In journalism, we try to follow the age-old advice of, “If your Mother says she loves you, check it out.” Maybe Drudge should do the same thing?

–Paul



Categorized under:

If you enjoyed this post, please consider to leave a comment or subscribe to the feed and get future articles delivered to your feed reader.

Comments

No comments yet.

Leave a comment

(required)

(required)