What 20 years of “stolen” snippets teaches about managing AI generated code

For many years I led a team whose job it was to track down who actually wrote the software that major corporations were shipping in their products. As part of this review, we created Software Bills of Materials (SBOMs) containing lists of the large open source libraries most people think of when they think of SBOMs, but also information concerning bundles of individual source code files and even snippets of cut and pasted code.

In the last two cases, developers had gone out to the internet, found a blog or internet forum containing code that solved their problem and cut and pasted it into their code base. They often asked a search engine a question and found a block of code created by a human that almost exactly answered that question. With a quick cut and paste and some minor editing the developer would quickly be on their way with the programming problem solved.

The first major issue with these types of cut and pastes is the lack of compliance with open source licensing that this third party code was released under. The original authors published this code with the expectation that these licenses would be read, respected and complied with.

In the vast majority of cases, the code was brought over without any embedded licensing information and in many cases either silently copied in or seen following a comment by the developer jokingly disclosing something like “Stolen from blog xyz!”. At least you might say the second case actually gave a little bit of credit to the original author, though the lawyers involved would always get upset.

Additionally, the quality and security of this code could be questionable. Typically the code was created to demonstrate a point, or to provide simple guidance on how to do a task, but lacked security guardrails or error handling.

Through a lot of scanning and human analysis, many of these source code snippets were brought into compliance and safety while others required removal and an independent re-write in order to clear an open source license violation or quality problem.

Why does this happen?

Developers are under intense time pressure and will use any tools available in order to complete the work they have in front of them. This, combined with a common lack of understanding of open source licensing, lack of source code origin tagging and the difficulty of discovering and managing this code means that the path of least resistance allows this third party code to sit silently until an event like a M&A or lawsuit or contractual scanning requirement occurs that causes it to be discovered, often at great cost to fix.

Policies were often in place, but a paper process with no teeth and without tooling to enforce is pretty much worthless.

Additionally, the original code often lacked inline copyright and license statements making it difficult for the average software developer to know what the expectations on them were.

By examining the way that many developers interact with code from the developer help forum “Stack Overflow” shows us that licenses are confusing and developers don’t know or care enough to look for them. When a developer cuts and pastes code from Stack Overflow, the code by default comes under a Copyleft style open source license, the Creative Commons Share Alike License (CC-SA). See https://stackoverflow.com/help/licensing

This can cause headaches later on as the team tries to understand how this license affects their use of the snippet, makes effort to relicense this code by contacting the original author, or to somehow decide the code is “not protectable” which can be a difficult and risky process itself.

This type of cut and paste and then legal review can cause a lot of wrangling and debate. “Why did they post this if they didn’t want me to use it?” or “Are we REALLY going to get sued?” or “Everybody ELSE is doing it, why shouldn’t we?”

These questions are very much in line with the concerns and questions teams are having around AI generated source code.

How is it similar to the AI Code Generation problem right now?

The vast majority of code generated by AI/LLM code generators is emitted with no tagging or information tracking its origin. This is unlikely to change in the near future. Additionally, there is often a claim by the companies generating this code that the emitted code is not under an open source license. This code is being generated in large quantities at all levels of an organization with no tracking and policies that, even if in place, are at odds with reality. For example a policy that says “You will not use AI/LLM to generate source code for this project” is unlikely to be respected unless a clear believable reason is communicated to each developer personally. A policy slide on a Powerpoint on a shared drive someplace might as well not exist.

Is there an AI Code Generation Problem?

We don’t know and that is a tough place to be in. Companies who are in the business of selling AI Code generation are claiming there is no open source licensing problem. Content creators in other industries like Newspapers, Books and Cinema are filing lawsuits against similar claims in non-source code related fields of use. The next few years will be interesting.

From a security perspective, there is concern and some evidence that AI generated code may be of lower quality than that human generated code especially in areas that the developer requesting help lacks expertise.

What can be done to manage this problem?

Policies don’t often work – but are important

Policies around Open Source compliance and security work are important but are only given as much respect as they earn. A one-off policy that is buried, is too harsh, or that has no dedicated time on the roadmap and timetable is destined to be ignored either out of ignorance or perhaps the hope that the policy will change in the future or that the developer will be long gone before it’s an issue. Your Legal team might know about the policy but developers will almost never know. Think about the impacts of Bans! It is important to explain why a policy exists and how to ask for permission to change it or get permission for a variance.

What doesn’t work?

Expecting developers to add comments that are not present in the original code is very difficult. It’s a great requirement to have in your policy and guidelines, but you should expect that most people will not follow the request. This is especially difficult if the developers IDE itself is generating the code inline. It may be helpful to push back on the dev tool manufacturers to request tagging or inline comments for generated code, but it does seem that this is against current trends.

If the concern is Quality, require tools to enforce quality

Whether the vulnerable code is written by your developer, generated by an AI tool, or brought in by an open source project, scanners exist to help discover vulnerable code. They do require expertise and time to run and analyze results, BUT this time should be brought into the calculation as part of the decision to introduce code from the outside in the first place.

Tools might help, if they are used promptly

Whether its SCA scanners that do Snippet Matching to find unlabeled licensed code, or SAST/DAST scanners to discover vulnerable code, tools exist to help solve these problems, but are often left to the end when it’s too late to do anything, or overwhelm the user with perceived false positives and large amount of real issues to resolve. It takes a trained eye to understand these results and make the required security changes. There is no easy button.

Make it clear if there are areas where NO risks can be taken

You may have extremely important areas of your codebase that you can take no risks in. This might be your core engine, core forecasting model or “secret sauce” that you may wish to wall off from AI generated code. This is a perfect place to have an explicit policy communicated to anyone with check in privileges to this area. A file in this directory containing policies and expectations can be helpful as well.

Understand that some areas might be more likely to have problems

If you are using AI to generate code for Linux Device Drivers or to create functionality to integrate with a well known open source project, you may find yourself in a situation when the model gives you EXACTLY what you are asking it for. If these areas of functionality are highly likely to be licensed under the terms of the General Public License (GPL) or other copyleft/viral licenses, you may get back source code that sure looks like the GPL licensed ecosystem you asked for. This is an area to perform deep scans on and to be prepared to release to the community under a strong open source license itself.

It’s hard to slice and dice the law and ethics

It can be confusing to employees why you respect some open source licenses, but not others. Or why you do in certain areas but not others. Keep abreast of changes in the industry, especially around licensing and indemnification. Some AI vendors may indemnify your use of their code generation tools but there may be hoops to jump through in order to make sure you are covered. What are the limits of this coverage if it exists?

Where do we go from here?

The genie is out of the bottle, AI generated code is going into organizations’ codebases every day. Ignoring its existence is unlikely to be the best long term strategy. It will find its way into your codebase, and be in a form that will make it very difficult to track. If problems arise, it will be difficult to remove and replace. Through education, thoughtful and well communicated policies and proper tooling your organization can use the current crop of AI tools while being prepared for changes in the licensing, regulatory and community aspects of their use. Watch how parallel industries deal with this issue and be prepared to quickly communicate policy and legal changes to your developers. Communicate the risks so that developers can better use their judgment while building your products. The next few years will be a time of serious change in the software industry. Understanding how you use these new tools, and the potential impact on your bottom line, can help you best manage the impact these tools have on your software.

Zebra Cat Zebra

Jeff Luszcz's Tech and IP Blog

What 20 years of “stolen” snippets teaches about managing AI generated code