Like GitHub Copilot without Microsoft telemetry • The Register

updated GitHub Copilot, one of several recent tools for generating programming code suggestions with the help of AI templates, remains problematic for some users due to licensing and telemetry issues the software sends to the Microsoft-owned company .

So Brendan Dolan-Gavitt, assistant professor in the department of computer science and engineering at NYU Tandon in the United States, released FauxPilot, an alternative to Copilot that works locally without phoning Microsoft’s mother ship home.

Copilot is based on OpenAI Codex, a GPT-3-based natural language-to-code system that has been trained on “billions of lines of public code” in the GitHub repositories. This made free and open source software (FOSS) advocates uncomfortable because Microsoft and GitHub failed to specify exactly which repositories informed Codex.

As Bradley Kuhn, a policy fellow at the Software Freedom Conservancy (SFC), wrote in a blog post earlier this year, “Copilot leaves copyleft compliance as an exercise for the user. a growing responsibility that only increases as Copilot improves. Users currently have no methods other than serendipity and guesswork to know if Copilot’s output is copyrighted by someone else. “

Shortly after GitHub Copilot became commercially available, the SFC urged open source maintainers not to use GitHub in part due to its refusal to address concerns about Copilot.

Not a perfect world

FauxPilot does not use Codex. It is based on the Salesforce CodeGen model. However, this is unlikely to appeal to FOSS supporters because CodeGen has also been trained to use public open source code regardless of the nuances of the different licenses.

“The models he is using right now are the ones that have been trained by Salesforce, and have been trained again on basically all of the public GitHub code,” Dolan-Gavitt explained in a telephone interview with The register. “So there are still some problems, potentially with licensing, that wouldn’t be solved by this.

“On the other hand, if someone with sufficient computing power shows up and says, ‘I’m going to train a model that’s trained on GPL code only or has a license that allows me to reuse it without attribution’ or something like that, then they could train their model, put that model in FauxPilot and use that model instead. “

For Dolan-Gavitt, FauxPilot’s main goal is to provide a way to run AI assist software locally.

“There are people who have privacy concerns, or perhaps, in the case of work, some company policies that prevent them from sending their code to a third party, and this is definitely helped by the ability to run it locally,” he explained.

GitHub, in its description of the data collected by Copilot, describes an option to disable the data collection of code snippets, which includes “source code you are editing, related files and other files opened in the same IDE or editor, repository URL and file paths. “

However, this does not appear to disable the collection of user engagement data: “user modification actions such as accepted and ignored completions and general usage and error data to identify metrics such as latency and feature engagement “and potentially” personal data, such as pseudonymous identifiers. “

Dolan-Gavitt said he sees FauxPilot as a research platform.

“One thing we want to do is train code models that hopefully produce more secure code,” he explained. “And once that is done, we would like to be able to test them and maybe even test them with real users using something like Copilot but with our own models. So that was kind of a motivation.”

However, this presents some challenges. “At the moment, it is somewhat impractical to try to create a dataset that does not have security vulnerabilities because the models are really hungry for data,” said Dolan-Gavitt.

“So they want a lot of code to train on. But we don’t have very good or foolproof ways to ensure that the code is bug-free. So it would be an immense amount of work trying to curate a dataset that was free of security vulnerabilities. . “

However, Dolan-Gavitt, co-author of an article on the insecurity of Copilot code hints, finds AI assistance quite helpful in maintaining it.

“My personal feeling about this is that I’ve been turning Copilot on pretty much since it came out last summer,” he explained. “I find it really useful. That said, I have to double check it works. But often it’s easier for me to at least start with something it gives me and then modify it properly than try to create it from scratch.” ®

Updated to add

Dolan-Gavitt warned us that if you are using FauxPilot with the official Visual Studio Code Copilot extension, the latter will still send the telemetry, although not the code completion requests, to GitHub and Microsoft.

“Once we have our VSCode extension working … the problem will be solved,” he said. This custom extension needs to be updated now that the InlineCompletion API has been finalized by the Windows giant.

So essentially the basic FauxPilot doesn’t phone Redmond, although if you want a completely non-Microsoft experience, you’ll need to grab the project extension when it’s ready, if you’re using FauxPilot with Visual Studio Code.