“Big data” is a buzzword used to refer to a collection of structured and unstructured data sets so large and complex they are difficult to process using traditional database management software techniques. Some have also suggested big data is identifiable by the “three Vs”: size (volume), the speed with which it is created (velocity), and the type of information the sets contain (variety).
There doesn’t appear to be a point at which data becomes “big data,” but volumes of this size don’t sit on a single desktop machine. Examples of big data might be terabytes (1,024 GB), petabytes (1,024 TB), or exabytes (1,024 PB), which might consist of billions to trillions of records of millions of people — usually from different sources, and possibly in a variety of formats. Marketing people, for example, get excited about this kind of data about customers because they can develop search techniques to link those disparate bits together and form a clearer picture of a group’s (or person’s) shopping patterns.
Perhaps because of the sheer magnitude of its volume, big data can also refer to the technology, processes, and storage facilities required to handle it. Generally, big data sets can be distinguished because they are pushing existing technology infrastructure — particularly storage and processing — to its limits. The reason why big data is a big deal becomes apparent as data sets get larger and require ever-increasing system sophistication. Generally, most relationship database management systems and desktop applications are too inefficient or lack the capacity to manage big data, which can require massively parallel software running on multiple servers, numbering in the hundreds or thousands.
Big data is an industrial volume of data, made possible within large organizations and also through increasing access to and use of the cloud. The cloud can be analogized simplistically to an off-site electronic storage facility. The technologies that have made daily use of parallel-server-based off-site computing possible have contributed to big data.
What I’ve learned about big data is this: although size matters, much depends on how you handle it. Which brings us to what big data might mean to e-discovery.
If we approach big-data sets with existing tools, techniques, and workflow, we in the e-discovery world are simply not ready for truly big data — if we want to retain credibility with our clients. For this reason, I think it might mean we’re at the end of one way of doing things and at the beginning of new methods and ways of thinking about e-discovery.
Let’s start with tools: Many e-discovery systems and applications remain relational databases, and experience performance degradation (sometimes significantly) with extremely large data sets. So trying to bring a big data set into most traditional e-discovery tools will often be an exercise in frustration.
Then, there will be challenges with traditional approaches to search. Keyword and boolean searches will be insufficient to understand a data set of this size. Suppose your keyword searches return 3,120,373 responsive word hits in a data set of 25,034,863,845 documents. What does this really mean? Have you, as a lawyer, done your diligence? Can you say you understand the data set sufficiently to permit the certifications and attestations required by the modern discovery effort to occur? And the economics of document review with data sets this size are, well, uneconomic even relying on outsourcing.
There is a current trend of choking off the volume of data at the source, so a smaller amount of data will flow into the e-discovery work stream. Clients, weary of expanding e-discovery bills related only to data volume and not underlying litigation risk economics, are bringing some e-discovery functions in-house, developing and investing in tools and processes for managing their data so they can make efforts to parse out only what is required for litigation. Less data into the traditional e-discovery workflow reduces the work and cost downstream. However, when it comes to big data, this approach may just shift these challenges to the client, rather than a law firm or e-discovery vendor.
So the answer likely lies in new ways of doing things: brand new technologies and processes, involving new skill sets and, likely, yet-to-be-developed rules. The advent of big data has introduced new techniques for managing and mining data sets, as well as new processing and storage capabilities. What we see with the introduction and rise of computer-assisted review (or predictive coding) is just the beginning. The introduction of analytic and data-mining tools, and the greater utilization of parallel and grid processing and searching, will have major implications and impacts on e-discovery practitioners. Indeed, in the near future, an e-discovery practitioner may not design document review strategies at all, but design processes for certifying search analytics and work closely with linguists and data analysts to do so. (Assuming the Rules of Civil Procedure even require production of all records potentially responsive to material issues in dispute in the same way at all.)
But there’s more: Applied in the context of litigation, these innovations may help with case analysis or proactive risk management by helping to identify correlations between disparate case elements and gain earlier intelligence about the probabilistic outcome of cases. But these promises may only be realized by new ways of thinking about discovery efforts.
E-discovery has tended to become a goal unto itself, with practitioners and vendors ticking off all the boxes for each phase, striving for completeness so opponents cannot criticize process. At some point, we risk forgetting that in the litigation contest, e-discovery serves the fact-finding process for trial, and few documents are ever tendered into evidence. The new technologies permit critical records to be surfaced early in the litigation process, which should permit the earlier discovery of material facts and, possibly, a speeding up of the litigation cycle. Wouldn’t that in itself have some significant impacts?
Dera J. Nevin is managing counsel, e-discovery, TD Bank Group. She can be reached at email@example.com. The opinions expressed in this article are her own.