Various people have made a terrific effort to provide text-searchable versions of the 3000 pages recently issued by DOJ (that have been posted by the House in the form of 61 separate non-searchable pdfs), but in my opinion all the prior efforts have fallen short.
This includes my own effort, here, which was early but incomplete. Also, I was using an OCR program with a somewhat high error-rate.
ePluribus Media posted files (announced here). But for some reason their files are unnecessarily large. By the time they're done, I think the whole batch will be over 250mb (this is bigger than the original non-searchable files, which add up to about 172mb). [Update and correction: the original files add up to 142mb. The ePM files add up to 204mb.] I've created files 85% smaller, containing the same information, including graphics. Also, their approach requires searching across 61 separate files (Adobe Reader allows this, but it's cumbersome when the files are so big). I provide two approaches where 100% of the text is aggregated in a single file.
WaPo has posted files that are not text-searchable. McClatchy has posted files that are text-searchable, but I've discovered they are not complete (I knew it didn't make sense that they add up to less than 11mb).
Here are text-searchable files that are complete and small. They are in 3 different forms.
A) DOJ consolidated.txt.zip is a 1.5mb zip file. When unzipped, this yields a text file of 5mb. This text file contains all the text from the original 61 files. Just no graphics. That's why the file is small.
B) DOJ consolidated.pdf is 30mb. This is a single text-searchable pdf that contains a compilation of all 61 files (including text and graphics). This pdf has 2994 pages [see update below; new version has 3003 pages]. On my machine navigating through it is quite manageable.
C) DOJ individual files.zip is a 28mb zip file. When unzipped, this yields a folder containing 61 text-searchable pdfs. This approach corresponds to the way the material was originally packaged. The folder size is 30.8 mb.
I think that most people will find B to be the place to start (and I see it's the file that's been downloaded most frequently since I first posted this diary, over 300 times in the first two days). However, C will be useful in order to cross-reference excerpts back to the original batch. When communicating with reporters, for example, they'll want to know which original, official pdf contains the text you're citing. Using Adobe Reader to search in folder C will allow you to easily answer that question. It will enable you to give them an exact filename and page number. (Keep in mind that for C, I slightly reformatted the filenames to make them more readable, and to make them sort properly when presented in a list. However, it's easy enough to grasp how they correspond to the original, official filenames.)
Update, 3/22: Silly me! It's only just now that I notice the consolidated file (B) has bookmarks, added automatically by Adobe Acrobat when I first created the file. That means folder C probably has no purpose. The bookmarks in B can be used to do navigation in terms of file and page numbers that correspond to the original batch of files.
Update, 3/24. I just noticed that due to a clerical mistake on my part, file 3-9 was not included in the files I originally provided, above. What was bookmarked as "03-09.pdf" (inside B, the consolidated pdf) was actually an exact duplicate of the next file, 3-10. In other words, 3-10 was included twice, and 3-9 wasn't included at all. Sorry!
For what it's worth, 3-9 contains no emails, and no information which was not previously available to Congress. It consists exclusively of letters exchanged between DOJ and Congress in the period 1/11/07 through 3/5/07.
All the above downloads (A, B and C) have now been replaced with updated files that address this problem. If you want to make sure that your searches include the material in file 3-9, you should download an updated version of A, B and/or C.
The 61 original files provided by the House contain, in aggregate, 3005 pages. The new version of the consolidated pdf contains 3003 pages. That's because two pages were skipped (file 2-7, page 50, and file 11-3, page 35). These pages are poorly printed tables that are hard to scan, and my OCR program decided to skip them. Other pages of tables were scanned, but not well. There are several dozen pages of tables (like in file 11-3, for example), and they should be examined in the original files.