Structured Data Analytics for the Win
“When life gives you lemonade, make lemons. Life will be all like ‘Whaaat?!’” ~Phil Dunphy
Litigation Support Up to Bat; Structured Data Analytics in Play
In a former life long before Evolver, I was a bit of the clean-up batter – not in the sense of bottom of the 9th grand slams mind you but some pretty solid homers nonetheless. It was on one tricky project I had some success at the plate I can share about, everyone’s favorite: the botched received production.
The opposing party production mixed up TIFF images during their processing that resulted in random wrong images mated to the produced extracted text.
Here’s what we knew:
- The images’ Bates numbers matched extracted text but nothing else.
- The extent of the problem was unknown which compromised the timely review of the production.
Finding mismatched documents in 120,000-page production
Our client wanted to know the extent of the problem and asked if there was any way to find all the mismatched documents in the 120,000-page production without manually looking at each document. I felt like we were at the bottom of the 3rd inning and already five runs behind.
I thought about it and by utilizing kCura’s Relativity structured data analytics, specifically textual near duplicate identification, we could apply technology to the problem by comparing the produced extracted text against a set of OCR text we could get from the images. This is not a standard use for the technology.
What is Textual Near Duplicate Identification?
Textual Near Duplicate Identification analyzes and compares the text of documents assigning documents to text groups and scoring each document’s similarity to a particular principal document. This technology is used to ensure textually similar documents are similarly coded or redacted. We creatively used this identification process to compare one set of text (produced extracted text) with another set of text (derived OCR from images) to solve our problem.
Text similarity settings help us bring it in to home plate
We loaded all source 35,000 documents and produced extracted text to Relativity. We created a copy of the 35,000 documents by running OCR on the produced images. In a perfect word this project would produce textually similar groups and for each group there would be a “principal” for each produced extracted text document. The derived OCR text would slot in to these groups and there would be no outliers outside such groups.
But this not a perfect world! We ended up with a suspect set of 6,000 of the 35,000 documents that did not so group. By tweaking the text similarity settings and a few more runs we were down to a review set of ~1,100 docs which needed manual review.
Ultimately our eDiscovery team found less than 10 total mismatched documents. The balance of the 1,090 was attributed to OCR noise since not all the produced text received was purely extracted text. The client was happy with the relatively low-cost tech solution to a sticky problem. They now could review and search these documents with peace of mind and without the delay of waiting for a full re-production. The crack of the bat is such a sweet sound.
About the Author: Terry Lundy is Evolver’s Associate Director of eDiscovery Consulting. Terry provides support, counsel, and litigation services to Evolver’s diverse clientele on all stages of the EDRM life cycle. When not deep unweaving the web of complicated data sets, Terry roots for the Padres, although he cannot explain why.