Managers in the legal industry face a data deluge. Everywhere there are metrics that pertain to managing lawyers and legal issues, commonly from a vendor’s or another organization’s survey of law firms or corporate law departments. The plenitude of survey data is welcome, but it raises the stakes on the quality and reliability of the survey process.
The threshold question to be asked when reviewing a survey report is whether its underlying data and conclusions are reliable. One way to answer that question is to determine whether someone else, a neutral third party, could recreate the surveyor’s process (in a thought experiment if not actually) and reach similar conclusions. This article explores what data scientists refer to as producing “reproducible data.”
Our legal industry would be better off if providers of survey data observed at least minimal standards of “data hygiene.” Survey data that is intended to guide legal managers will be more effective, more trusted, and indeed bring about management improvements more quickly if those who are involved observe the basic rules of data reproducibility.
Four Key Questions About Survey Data and Findings
There are four basic assessments of whether a survey has satisfied the essential requirements for reproducible data. If we feel comfortable that the answers are mostly yes, then the survey’s findings should be considered valid.
1. Did the surveyor summarize how they got their data and describe the respondents’ profiles?
If a report draws on survey data, it should explain clearly how the survey data was collected. It should summarize who was invited to take the survey, e.g., “In March 2015, we emailed a random sample of general counsel of U.S. companies from our client and prospect database,” and how it came upon that collection of people, e.g., “We obtained a list of email addresses from a reputable provider.” The report should state how many were invited and how many responded, e.g., “Our email invitation went to 2,500 general counsel, of which 250 responded (a 10 percent response rate).” It would also be instructive to say what the invitation said and whether there were any inducements offered to encourage participation.
It's unlikely that someone else could invite the same group of people let alone have the same ones respond, but at least readers of the survey report could assess how representative the data is (or would be if they tried to duplicate the study).
Along with that background, the surveyor needs to give readers a sense of the "demographics" of the respondents. This means the numbers and strata of participants in the survey, including such variables as age, revenue, position, gender or whatever is relevant, e.g., “About a third of the respondents were in companies of less than $1 billion, and nearly half of them were in the manufacturing industry.”
Readers need to know that the surveyor has gathered data from a reasonable number of people who are representative in the context of the conclusions. Survey results based on small numbers, especially those within a narrow demographic profile, such as “13 users of our software who are all in-house counsel, mostly with start-up tech companies,” should not be extrapolated to conclusions for all U.S. law departments.
2. Are the questions asked plausible and clear?
Academic researchers typically include in appendices the exact questions from surveys they have deployed. Wording can strongly influence the quality of the data obtained. For example, a leading question will distort answers, and multiple choice questions raise many methodological risks. Simple yes/no questions are especially vulnerable to questionable interpretations, such as “Do you think law firms overcharge? Answer yes or no.” Ambiguity and conjunction in a question robs the data of usefulness, such as “How frequently did you use alternative fee arrangements or secondments?” That would be a very poor question because it doesn’t define “alternative fee arrangements” or “secondments”; it doesn’t limit the time period, such as “in the past 12 months”; it doesn’t distinguish between dollars spent and numbers of matters; and it conflates two different techniques.
3. Did the sponsor of the survey explain how they prepared the survey responses for analysis?
Another milestone to provide reproducible data is to explain the steps involved in preparing the survey data for analysis. Raw data is inevitably sloppy. Some people don't answer every question; some put in ambiguous text responses; others rank choices in reverse order, and so on. One particular challenge that deserves disclosure is what the surveyor did with unusually high or low values, known as “outliers,” such as a respondent who states a large company’s revenue as a mere $12,345 – surely a typo or mistake. The consequences of massaging and cleaning data can be considerable, so those crucial efforts need to be explained. Once again, the touchstone is whether someone else could follow your data-scrubbing process and reach the same final set of numbers. (Garbage in, garbage out is oh so true!)
One of the decisions often made during cleanup is whether to use the full data set or a limited part of it. For instance, the surveyor might filter down to only law department respondents who are in companies of more than $100 million in revenue.
Here is where Excel and other spreadsheet packages harbor risks. Relying on them makes it much more difficult to keep an “audit trail” of changes made to the data, compared to scripting programs that track and store step-by-step data alterations.
4. Has the surveyor laid out the methods of data analysis?
Surveyors aggregate their data in different ways, calculate metrics about the results, such as means, medians and quartiles, and produce tables or graphics. It is vital for data reproducibility to lay out how the data has been analyzed. For example, did the surveyor convert revenue to a log scale, or center or normalize spending amounts? Conscientious surveyors should also specify any findings that they did not include in their results to guard against selective presentation of results. Cherry picking results destroys the value of the survey.
If the findings include graphics, can a reader understand them? Principally, each chart should present one basic idea as clearly as possible so that a reader can look at the chart, understand the data it visualizes, and understand the takeaways.
If survey sponsors address the four questions above in a reasonably fair and complete way, they have earned the right to have their results evaluated seriously. In an ideal world, someone else would be able to follow the same path and obtain very similar findings. This would mean they have conducted a reproducible data survey.
With the key questions answered about provenance as well as the underlying data available and a roadmap for how the data was massaged and presented, someone else can evaluate or test the methodology and corroborate the conclusions. This may be difficult with legal industry data because some of it is proprietary. No company wants to have the world know how many dollars it spent on a particular law firm in the past year, to give but one example.
The ultimate compliance with the spirit of data reproducibility would be when the surveyor shares the actual data with a third party. If that were done, the data might be anonymized to avoid disclosure of sensitive data or to break any link between a specific respondent and specific data.
Adherents of reproducible data will sometimes post their data and code on publicly available websites such as GitHub. Others will provide at least the data in appendices, or they will send it to people who request it. The programming code used should have comments in it to explain to another person the function of each portion of the code.
Creating reproducible data is like revising a Word document using track changes, where the revisions are all shown and comments in the margin explain further. Combine that with the methodology of a legal research memorandum where all the cases are cited with an explanation of how each was found.
In the future, we may have third parties who hold survey data, at the very least, or certify that the data is as represented by the survey report. More than that, someone might run some simple calculations to see if the date is clean and has been analyzed reasonably (e.g., if data is missing, it is not a zero!). Data reproducibility should not be confused with an assessment of whether the quantitative analytics are smart or dumb, comprehensive or partial, or useful in the real world or not. It simply verifies that the collection and preparation of the data can be matched step for step so that the results can be corroborated.
All of us who care about the contribution data can make to legal managers and their decisions should push for standards of data reproducibility in industry surveys.
Published July 17, 2015.