STEPHEN F. HAYES has written extensively in these pages about a large cache of documents and digital media captured in the course of Operation Iraqi Freedom and Operation Enduring Freedom. As a former intelligence officer who dealt with digital media exploitation and analysis issues at the Defense Intelligence Agency for nearly four years (2001 to 2005), I am prohibited from speaking publicly about what these documents may contain. What I can do is share my professional opinion on how one might solve some of the major problems associated with media exploitation.
Let us assume hypothetically that the United States has overthrown a hostile regime, and a vast amount of paper and digital media has been looted or otherwise removed from the regime's ministries, industrial centers, and other facilities. A great deal of this material has been obtained by the U.S. military and eventually the U.S. intelligence services.
Because of the lack of context--reliable information about where each item was obtained, who it belonged to, and so on--U.S. intelligence is faced with trying to make sense of a massive, amorphous heap of paper and digital data.
The demands are tremendous. Combat commanders need actionable intelligence so they can turn around and capture or kill more of the enemy (and obtain still more media to exploit). But technical expertise and high-end equipment are hard to come by. So is good, trustworthy linguistic support. Subject matter experts are by and large still back in Washington. Given the problems, how does U.S. intelligence perform deep analysis on data that clearly need it?
The process of exploitation begins with the recognition that neither human intelligence nor signals intelligence is the be-all and end-all. Human sources can lie. They can hide parts of the truth. Unwitting dupes in a deception scheme can honestly tell you what they think is the truth. Intercepted signals generally reveal only part of the intelligence picture. In a complex web of bad guys, tapping the phones of one or two leaves a lot of gaps, especially when your adversary is a whole network of webs.
Digital media, on the other hand, are less prone to be a means of deception, and even one node of a network can reveal a significant amount about the entire network. Think about the data that you keep on your computers at work and at home. Unless you write fiction for a living, these are the most accurate and factual data that can be obtained about you (short of reading your mind). The memos and letters you write, the financial information you calculate, the websites you visit, and the people you email or instant-message--all this is a gold mine for anyone looking to know who you are, what you do, and with whom you cavort. Now imagine having access to the same data about your adversary.
Enter "computer forensics." Exploiting paper documents is a relatively simple matter of reading and, if necessary, translating. Exploiting digital media is another story. Before you can read the data, you have to find it.
Outside the intelligence field, computer forensics is the process by which data are extracted, preserved, and analyzed for pertinence and meaning. The computer forensics community has worked very hard to bring its practices up to the level portrayed on TV in shows like CSI, where digital evidence is now accepted in court as much as fingerprints or blood splatters.
It stands to reason that the same people, tools, and methods used in computer crime labs are also used in intelligence efforts. However, the courtroom-centric, linear, law-enforcement mindset is actually a hindrance to effective exploitation for purposes of intelligence. A military intelligence unit is not interested in going to court; it is interested in helping soldiers put steel on target. This is not to say that a law enforcement approach has no use in the larger intelligence business (for example, in counterintelligence investigations), but if the goal is good data fast, then what is good for cops is not good for soldiers.
ASSUME OUR HYPOTHETICAL hostile regime was a fairly large country with a population around 25 million. It was not the most technically advanced nation in the world, but it had ministries and industries and was believed to have advanced weapons capabilities. All these needed computers to function. How much data does this translate into? Consider some rough calculations.
One floor of an average-sized university library full of academic journals contains about 100 gigabytes of data, the size of a large but not uncommon hard drive. The data in 100 such hard drives are comparable to the print holdings of the Library of Congress. Care to guess whether our formerly hostile regime had more than 100 computers?
As if sheer quantity of data were not problem enough, remember that the materials have almost no supporting contextual information. A computer forensics examiner in a crime lab generally has access to the investigators, knows the nature of the crime, and knows the most common places to look for evidence. A piece of evidence comes to him in a plastic bag with a tag on it saying where it was found, what kind of computer it came out of, and so on.
On the battlefield there is no time to "bag-and-tag" evidence. You find something that looks useful; you grab it, secure it, and move on. When the mission is over, you head to the tent where the Military Intelligence guys hang out and drop off your goods, covered in dust and a lot worse for wear. Under such conditions, context beyond a label reading "hard drive found on Monday" is scarce.
You have a huge store of data and only the slightest idea where it came from, a vague idea of what to look for, and you must do the job to a standard of proof mindlessly imported from law enforcement and far exceeding what is necessary for your work. Is it any wonder that some consider the job hopeless? How can we hope to make any real sense of this mass of stuff?
Technology can help. First, when data come without any meaningful context, we have to re-create it after the fact. We begin to do this by building lists of keywords, phrases, personalities, and other data that pertain to the topics of interest to our intelligence services. These lists can easily include tens of thousands of terms, names, figures, and data formats.
The next step is to create a forensically sound process to spin off the more meaningful pieces of data (user-created documents, emails, spreadsheets, etc.) while leaving behind data that have less utility (files associated with the operating system and software applications). Let's call this our forensic centrifuge.
Ideally our centrifuge will be built out of a cluster of computers: dozens of cheap processors networked together and scaled to rival a supercomputer in power. Cluster computers have been used by academia and the government for years, notably in places like NASA and the Department of Energy.
Computer programs written to take advantage of the multiprocessor capabilities of the centrifuge will extract the easy-to-obtain data files, recover deleted files and those that have been obfuscated by various means, and find the data stored in web browsers, email software, and other programs. There are commercial applications that do this, but our applications will have to be custom-made.
Once we have this notional system, we can aim it at our amorphous heap of captured data. The result should be large but much more meaningful subsets of data that we can be reasonably assured were created by members of the former regime. The problem of authenticity that sometimes complicates the exploitation of paper documents virtually does not arise.
While we now have all the meaningful data we can obtain, there is one more step to take before we can overlay what is called our "contextual appliqué." Our extracted data files must be compared with files of the same type--another computer process easily crafted--for both physical and content similarities. Through this process we should be able to determine things like:
* the names of people who drafted, edited, and were expected to receive memorandums, letters, and orders, and sometimes which computers they worked on;
* which computers were likely networked together, within the same ministry or between trusted associates;
* discussions between former regime elements in the form of both memorandums and email exchanges, as well as the personal thoughts revealed in private letters between confidants; and
* the foreign contacts of former regime elements in the form of email addresses and website data.
This information and more can be used to reconstruct both the physical and social networks of our former hostile regime. It can show who was talking to whom and who was working on what prior to the war. Our contextual appliqué is now complete, and many gaps left by insufficient prewar human and signals intelligence can be filled in.
THE SYSTEM JUST DESCRIBED for sorting and organizing data is notional, but not fanciful. The technology exists, the mental wherewithal exists, and the contract vehicles exist. The problem of finding enough qualified, trusted Arabic speakers and translators is great, but familiar. If we want to do this, we know how. If we want to do it fast, and provide sufficient resources, we can see significant results this year.
Adapting widely accepted technical methodologies to the unique challenges our intelligence services face is merely good sense. Modern technologies could be put to good use by the intelligence community to solve data extraction, processing, analysis, and display problems, if only certain elements in the community could get over the "not-invented-here" syndrome. There are signs of progress, but it is slow. Let's face it: You've probably got more powerful software on your computer at home than the average intelligence analyst has on the job.
There is of course a strong political aspect to media exploitation. Which end of the political spectrum will come out ahead is not clear going in. We could very well have in our possession ample material to support all the reasons the public was told justified going to war--or we could find the opposite, or find there are no clear conclusions to be drawn. But unless we look, we will always be faced--in the immortal words of Donald Rumsfeld--with a huge cache of "unknown unknowns."
After all the detainees have been interrogated, and all of the sand at suspected facilities has been sifted and tested, the only way finally to close the book on what our hypothetical former hostile regime was up to is to analyze every last reliable source of data available to us. That is, if we are really interested in the truth.
Michael Tanji is an associate of the Terrorism Research Center. He opines on intelligence and security issues at groupintel.com.