A database of electronic documents to review presents a number of challenges, including how to find relevant and privileged documents in a quick and cost-effective manner. By taking simple steps to improve the search context, you could improve your review efficiency and accuracy.
By search context, I do not mean understanding what the legal issues are (although this is important, too), but rather understanding how the database containing the electronic evidence will respond to your proposed search. Search context can be improved with attention to three things: the state of the index; the searchability of the documents; and the bias of the search engine or search feature.
First, determine the state of the document database index. Understanding what is in the index and how it is built is necessary, because the quality of the index is critical to effective search. In almost every document database, the computer is not actually searching the documents in a literal sense. Instead, the computer is searching the index. The index can be analogized to a table of contents, or a detailed index in the back of a textbook; when a user types a keyword, the computer looks first to the index and then uses the index to locate and highlight the indexed words inside the documents. If a word is not in the index, the computer may not be able to find it in the documents. That is why a well-built index is essential to search.
Indexes are built using index engines, and different document review tools contain different index engines and not all index engines behave the same way, or include the same words or characters in their lists. It is therefore important to check how the program is building that index if you want to really understand how and whether your searching will generate the results you want. Start by asking what the index engine is in your tool, and what it can index. Are there some words or figures it will not index? Can parts of words or numbers be indexed? What about punctuation, special characters, or elements of foreign languages?
Second, determine whether all the documents in your database that contain text are searchable. A document that contains text in the image may not, in fact, have that text available to the index engine, and that text is therefore not searchable. For example, pdf documents, which are created or converted to images, are not necessarily searchable. I encounter this often with pdfs that are attachments to e-mails. Often, these pdfs have been created on an office scanner which has not had its settings configured to create searchable text within the scanned document. When there are unsearchable files in your document population, you may wish to run a secondary process across those documents to make them searchable. Two related processes — optical character recognition and optical word recognition — can make the letters and words in those images available to the index engine, and therefore searchable.
Most litigation support systems that process files can produce a report that identifies files and file types that have not been indexed. Where a file type that could contain or that does contain text exists, but is un-indexed, you can send those files for additional processing, and rerun the index engine across those documents (or across the extracted text from those documents). It might also be useful to give any service providers you work with standing instructions to make text searchable where text is indicated or available.
Third, determine how the search engine within your document review platform works. Search engines are not neutral in the sense that each one has been programmed to work and return results in a certain way. This is a search bias and it is advisable to understand how those biases operate and whether your searching strategy needs to change as a result. For example, in many databases containing law, the newest results are presented first. Usually, that is not a problem, and we adjust our review of the results accordingly. However, one can imagine a situation in which the newest results (at a local court), are not as important as those from an older but more persuasive authority (for example, a Supreme Court of Canada decision). In litigation support databases, search engines can affect the presentation of results, usually in the context of the order in which results are returned. Understand how the search engine returns results and make any adjustments to your search strategy that may be necessary to offset the bias.
Understanding the context of your search helps you design appropriate searches and validate your search results. Now, check your search tool and determine whether it offers a larger range of functionality than keyword and Boolean searching. Some search tools offer fuzzy searching and thesaurus-based searching. These features can enhance searches.
Determine as well whether you need a tool with advanced search analytics that can recognize related concepts. Consider this: using the keyword “dog” will turn up only items in which that specific word is used. However, in a case about “dog,” relevant documents may also contain the terms: poodle, dachshund, canine, puppy, and Rover, as well as the related words “walk” and “bone.” Such documents would not appear in any search using the keyword “dog” unless the tool has functionality, such as word- or concept-clustering or thesaurus-based features that would also return those documents. I use this example to show both the limitations of relying solely on keywords, and to prompt inquiry into what other search functionality may be available or required.
Finally, do a quality assurance check on your search results. Do both a qualitative (testing related concepts) and a quantitative (testing a certain proportion of the remaining records) check to validate your search. Check a sample of the records not returned in search to verify that nothing was inadvertently omitted. Take good notes throughout your search, or save system reports, so that you can rely on those if ever your search results are challenged.
Dera J. Nevin is the senior director, litigation support, and e-discovery counsel at McCarthy Tétrault LLP. A practising lawyer, she also oversees the firm’s e-discovery operations and can be reached at [email protected].
By search context, I do not mean understanding what the legal issues are (although this is important, too), but rather understanding how the database containing the electronic evidence will respond to your proposed search. Search context can be improved with attention to three things: the state of the index; the searchability of the documents; and the bias of the search engine or search feature.
First, determine the state of the document database index. Understanding what is in the index and how it is built is necessary, because the quality of the index is critical to effective search. In almost every document database, the computer is not actually searching the documents in a literal sense. Instead, the computer is searching the index. The index can be analogized to a table of contents, or a detailed index in the back of a textbook; when a user types a keyword, the computer looks first to the index and then uses the index to locate and highlight the indexed words inside the documents. If a word is not in the index, the computer may not be able to find it in the documents. That is why a well-built index is essential to search.
Indexes are built using index engines, and different document review tools contain different index engines and not all index engines behave the same way, or include the same words or characters in their lists. It is therefore important to check how the program is building that index if you want to really understand how and whether your searching will generate the results you want. Start by asking what the index engine is in your tool, and what it can index. Are there some words or figures it will not index? Can parts of words or numbers be indexed? What about punctuation, special characters, or elements of foreign languages?
Second, determine whether all the documents in your database that contain text are searchable. A document that contains text in the image may not, in fact, have that text available to the index engine, and that text is therefore not searchable. For example, pdf documents, which are created or converted to images, are not necessarily searchable. I encounter this often with pdfs that are attachments to e-mails. Often, these pdfs have been created on an office scanner which has not had its settings configured to create searchable text within the scanned document. When there are unsearchable files in your document population, you may wish to run a secondary process across those documents to make them searchable. Two related processes — optical character recognition and optical word recognition — can make the letters and words in those images available to the index engine, and therefore searchable.
Most litigation support systems that process files can produce a report that identifies files and file types that have not been indexed. Where a file type that could contain or that does contain text exists, but is un-indexed, you can send those files for additional processing, and rerun the index engine across those documents (or across the extracted text from those documents). It might also be useful to give any service providers you work with standing instructions to make text searchable where text is indicated or available.
Third, determine how the search engine within your document review platform works. Search engines are not neutral in the sense that each one has been programmed to work and return results in a certain way. This is a search bias and it is advisable to understand how those biases operate and whether your searching strategy needs to change as a result. For example, in many databases containing law, the newest results are presented first. Usually, that is not a problem, and we adjust our review of the results accordingly. However, one can imagine a situation in which the newest results (at a local court), are not as important as those from an older but more persuasive authority (for example, a Supreme Court of Canada decision). In litigation support databases, search engines can affect the presentation of results, usually in the context of the order in which results are returned. Understand how the search engine returns results and make any adjustments to your search strategy that may be necessary to offset the bias.
Understanding the context of your search helps you design appropriate searches and validate your search results. Now, check your search tool and determine whether it offers a larger range of functionality than keyword and Boolean searching. Some search tools offer fuzzy searching and thesaurus-based searching. These features can enhance searches.
Determine as well whether you need a tool with advanced search analytics that can recognize related concepts. Consider this: using the keyword “dog” will turn up only items in which that specific word is used. However, in a case about “dog,” relevant documents may also contain the terms: poodle, dachshund, canine, puppy, and Rover, as well as the related words “walk” and “bone.” Such documents would not appear in any search using the keyword “dog” unless the tool has functionality, such as word- or concept-clustering or thesaurus-based features that would also return those documents. I use this example to show both the limitations of relying solely on keywords, and to prompt inquiry into what other search functionality may be available or required.
Finally, do a quality assurance check on your search results. Do both a qualitative (testing related concepts) and a quantitative (testing a certain proportion of the remaining records) check to validate your search. Check a sample of the records not returned in search to verify that nothing was inadvertently omitted. Take good notes throughout your search, or save system reports, so that you can rely on those if ever your search results are challenged.
Dera J. Nevin is the senior director, litigation support, and e-discovery counsel at McCarthy Tétrault LLP. A practising lawyer, she also oversees the firm’s e-discovery operations and can be reached at [email protected].