An experiment to test GitHub Copilot's legality

Preface

I am not a lawyer. This post is satirical commentary on:

The absurdity of Microsoft and OpenAI’s legal justification for GitHub Copilot.
The oversimplifications people use to argue against GitHub Copilot (I don’t like it when people agree with me for the wrong reasons).
The relationship between capital and legal outcomes.
How civil cases seem like sporting events where people “win” or “lose”, rather than opportunities to improve our understanding of law.

In the process, I intentionally misrepresent how the judicial system works: I portray the system the way people like to imagine it works. Please don’t make any important legal decisions based on anything I say.

The only section you should take seriously is “Context: the relevant technologies”.

Introduction

GitHub is enabling copyleft violation at scale with Copilot. GitHub Copilot encourages people to make derivative works of source code without complying with the original code’s license. This facilitates the creation of permissively-licensed or proprietary derivatives of copyleft code.

Unfortunately, challenging Microsoft (GitHub’s parent company) in court is a bad idea: their legal budget probably ensures their victory, and they likely already have a comprehensive defense planned. How can we determine Copilot’s legality on a level playing field? We can create legal precedent that they haven’t had a chance to study yet!

A chat with Matt Campbell about a speech synthesizer gave me a horrible idea. I think I know a way to find out if GitHub Copilot is legal: we could use its legal justification against another software project with a smaller legal budget. Specifically, against a speech synthesizer. The outcome of our actions could set a legal precedent to determine the legality of Copilot.

Context: the relevant technologies

Let’s cover the technologies and actors at play before I start my evil monologue.

Exhibit A: GitHub Copilot

GitHub Copilot is a predictive autocompletion service for writing software. It’s powered by OpenAI Codex, a language model based on GPT-3. It was trained using the source code of public repositories hosted on GitHub, regardless of their licensing. In response to a Request for Comments from the US Patent and Trademark Office, OpenAI claimed that “Artificial Intelligence Innovation”, such as code written by GitHub Copilot, should be considered “fair use”.^note 1

Many of the code snippets it suggests are exact copies of source code from various GitHub repositories. For an example, see this tweet: I don't want to say anything but that's not the right license Mr Copilot. by Armin Ronacher (here’s an archive link that doesn’t require JavaScript, captured on 2022-07-01) It contains a screen recording of Copilot suggesting this Quake code. When prompted to do so, it obediently fills in a permissive license. That permissive license violates the Quake code’s GPL-2.0 license. Copilot provides no indication that a license violation is taking place.

GitHub performed its own research into the matter.^note 2 You can read about it on their blog: GitHub Copilot research recitation, by Albert Ziegler. I’m not convinced that it accounts for the fact that suggested code might have mechanical alterations to match surrounding text, while still remaining close enough to trained data to be a license violation.

Exhibit B: The Eloquence speech synthesizer

I recently had a chat with Matt on IRC about screen readers and different types of speech synthesizers. I mentioned that while I do like some variety, I always find myself returning to the underrated robotic voice of eSpeak NG. He shared some of my fondness, and also shared his preference for a similar speech synthesizer called Eloquence.

Downloads of Eloquence are easy to find (it’s even included with the JAWS screen reader), but I struggle to find any “official” pages about the original Eloquence. Nuance acquired Eloquent Technology, the developer of Eloquence. Microsoft later acquired Nuance.

Eloquence sample audio

Download audio file eloquence.mp3

Matt recorded this sample audio clip of Eloquence reading some text. The text is from the introduction of Best practices for inclusive textual websites

Toggle audio transcript

Audio transcript

My primary focus is inclusive design. Specifically, I focus on supporting underrepresented ways to read a page. Not all users load a page in a common web-browser and navigate effortlessly with their eyes and hands. Authors often neglect people who read through accessibility tools, tiny viewports, machine translators, “reading mode” implementations, the Tor network, printouts, hostile networks, and uncommon browsers, to name a few. I list more niches in the conclusion. Compatibility with so many niches sounds far more daunting than it really is: if you only selectively override browser defaults and use plain-old, semantic HTML (POSH), you’ve done half of the work already.

I like the Eloquence speech synthesizer. It sounds similar to the robotic yet predictable voice of my beloved eSpeak NG, but with improved overall quality. Unfortunately, Eloquence is proprietary.

Exhibit C: Deep learning speech synthesis

Deep learning speech synthesis is a recent approach to speech synthesizer creation. It involves training a deep neural network on voice samples, and using the trained model to generate speech similar to a real human voice. One synthesizer using deep learning speech synthesis is Mozilla’s TTS.

Zero-shot approaches could allow a pre-trained model to generate multiple different voices. YourTTS is one such example. This could allow us to synthetically re-create a person’s voice more easily.

My horrible plan

My horrible plan revolves around going through two different lawsuits to set some judicial precedents; these precedents could improve the odds of succeeding in a lawsuit against Microsoft for Copilot’s licensing violations.

If this succeeds, we have new legal justification that GitHub Copilot is illegal; if it fails, we have still gained a means to legally re-create proprietary software. It’s a win-win situation.

Part One: set a precedent

Train a modern text-to-speech (TTS) engine using the voice a proprietary one made by a company with a small legal budget. Keep the model’s internals hidden.
Then release the final TTS under a permissive license. Remember, we’re still keeping the machine-learning model hidden!
Wait for that company to file suit.^note 3
Win or lose the case.

Part Two: use that precedent against Microsoft’s Nuance

Our goal here is to get the same legal outcome as the low-stakes “trial run” of Part One.

Microsoft owns Nuance. Nuance previously bought Eloquent Technology, the developers of the Eloquence speech synthesizer.

Repeat Part One against Nuance speech synthesizers, including Eloquence. Go to court.
Have the ruling from Part One cited as legal precedent.
Achieve the same outcome as Part One, demonstrating that we have indeed set precedent that works against Microsoft’s legal department.

Implications of the outcomes

If we win both cases: Microsoft has the legal high ground. Making a derivative of a copyrighted work using a machine-learning algorithm allows us to bypass copyright licenses.

If we lose both cases: Microsoft does not have the legal high ground. We have good judicial precedent against Microsoft to use when filing suit for Copilot’s behavior.

Either way, it’s an absolute win for free software. Taking down Copilot protects copyleft from enabling proprietary derivatives (and by extension, protects software freedom). But if we accidentally win these two low-stakes “test” cases, we still gain something else: we can liberate huge swaths of proprietary software, starting with speech synthesizers.

Update: on satire

This post isn’t “satire through-and-through” like something from The Onion. Rather, my intent was to make some clear points, but extrapolate them to absurdity to highlight other problems. I don’t think I was clear enough when doing this. I’m sorry.

Copilot has been found to suggest significant amounts of code that is dangerously similar to existing works. It does this without disclosing obligations that come with those works’ licenses. Training a model on copyrighted works may not be wrong in and of itself; however, using that model to generate new works that are not sufficiently distinct from original works is where things get problematic. Copilot’s users could apply proprietary licenses to the generated works, defeating the point of copyleft.

When a tool almost exclusively encourages problematic behavior, the makers of that tool should have put thought into its implications. GitHub and OpenAI have not demonstrated a sufficiently careful approach.

I don’t think that “going after” a smaller player just to manipulate our legal system is a good thing to do. The fact that this idea seems plausible to some of my readers shows how warped our perception of the judicial system is. Even if it’s accurate (I doubt it’s accurate, but I’m not certain), it’s sad. Judicial systems incentivise too much predatory behavior.

Corrections

Updated on 2022-07-02: It’s has come to my attention that Eloquence may or may not still belong to Nuance. Further research is needed. Eloquent Technology was acquired by SpeechWorks in 2000.

Footnotes

See Comment Regarding Request for Comments on Intellectual Property Protection for Artificial Intelligence Innovation (application/pdf) submitted by OpenAI to the USPTO
Back
I doubt anybody worth their salt would count on a company to hold itself accountable, but at least they tried.
Back
If the stars align, you could file an anticipatory suit against the company. It’s common for declaratory judgement regarding intellectual property rights.
Back