A proposal for Comment Tagging AI Generated Source Code
Source code generated by “AI” tools like GitHub CoPilot or OpenAI ChatGPT should prepend a language appropriate comment block explaining that the source code was generated by a tool as well as helpful metadata to allow discovery and management of that code by code scanners like SCA or SAST tools.
End users would obviously have the ability to remove this comment block, though I believe we all would be well served by marking all generated code with comments detailing the tool that generated it, the version or date generated, as well as inputs that might be helpful in understanding the conditions that caused this code to be generated.
AI Generated code, while still a new universe, appears to have a series of potential defects that existing and future
code scanners will need to be on the look out for. These include code hallucinations, missing cases, confidently wrong constants/algorithms, concerns around the license of the generated code, and other issues we still have not discovered.
Having a clear machine readable key that indicates that this code was AI generated allows for appropriate scanning, filtering, as well as metrics generation.
Code or data generated by AI based tools may have different standards of trust than code or data created or curated by human authors.
Parallels to SCA Snippet Analysis
There are parallels in Software Composition Analysis (SCA) snippet scanning world where awareness of generated code is very helpful when scanning or clearing scan results.
In the snippet analysis world, generated code is extremely similar to massive amounts of other open source code generated by the same tool. Therefore, performing snippet matching is often slow and resource intensive due to the sheer amount of similar snippets. This causes user pain due to slow scanning as well as a perceived large amount of “false positive” matches. There is also a belief that this generated code is “fine” which means it is often incorrectly ignored when it comes to SCA/SAST scanning due to the above issues.
Code generated by traditional non-AI code generators like the .NET IDE, Antlr, Apache MyBatis, protobuf, etc.. often tag their generated code with special comment strings and tags.
This allows SCA tools or SCA tool users to either ignore snippet matching for these files before scans are performed, automatically bucket or filter results afterward, or allow the end user to manage the results quickly through string matching.
One issue with these code generators is that the tags used are not standardized and require multiple methods to discover. The identifying strings include XML fragments, strings, JavaDoc style tags, custom tags, etc…
Future SCA/SAST tools can be even more nimble as they become more aware of the possible code generators that exist and perform appropriate scanning methods to the generated code depending on what needs to be discovered.
Qualifications of a good “generated by” comment
- Easy to parse by machine (oh the irony!)
- Easy to read and understand by a human
- Not too wordy so it will be left in place by the end user
- Not too wordy so that code generators decide to use it
- Explains what tool generated the code using a unique name
- Provides a version number or generated date so that eras of similar code can be examined with appropriate tools
- Does not change too quickly so that code generated by the same tool can easily be found with simple pattern matches or even greps
Future extension
In the future, the user text prompt that caused the code to be generated should be embedded as well
Current AI code outputs are typically single pages and should therefor have a single line comments.
Future code generators will generate entire applications and should have a larger banner with more details explaining the user prompts that generated the application.
A tool URL or project home URL (e.g. @generatorURL ) could be optionally used to prevent naming confusion and/or provide easy branding or publicity for the various tools
Current Proposal:
// @generatedNote This code was generated by a AI code generator tool. // @generatedBy CoolAIGenerator v1.2.3
Examples of current comments from non-AI Code Generators:
/* * Created on 2022-11-27 ( 18:26:59 ) * Generated by Telosys ( http://www.telosys.org/ ) version 3.3.0 */
/** * Alertmanager API * API of the Prometheus Alertmanager (https://github.com/prometheus/alertmanager) * * The version of the OpenAPI document: 0.0.1 * * * NOTE: This class is auto generated by OpenAPI Generator (https://openapi-generator.tech). * https://openapi-generator.tech * Do not edit the class manually. * */
//------------------------------------------------------------------------------ // <auto-generated> // This code was generated by a tool. // Runtime Version:4.0.30319.42000 // // Changes to this file may cause incorrect behavior and will be lost if // the code is regenerated. // </auto-generated> //------------------------------------------------------------------------------
/** * * This class was generated by MyBatis Generator. * This class corresponds to the database table CargoLocation_Data * * @mbg.generated do_not_delete_during_merge */
/** * This field was generated by Apache iBATIS ibator. This field corresponds to the database column t_quotation_product_detail.id * @ibatorgenerated Wed Oct 14 14:13:27 CST 2009 */
# Generated by the protocol buffer compiler. DO NOT EDIT! # source: google/cloud/audit/audit_log.proto
/******************************************************************************* **NOTE** This code was generated by a tool and will occasionally be overwritten. We welcome comments and issues regarding this code; they will be addressed in the generation tool. If you wish to submit pull requests, please do so for the templates in that tool. This code was generated by Vipr (https://github.com/microsoft/vipr) using the T4TemplateWriter (https://github.com/msopentech/vipr-t4templatewriter). Copyright (c) Microsoft Corporation. All Rights Reserved. Licensed under the Apache License 2.0; see LICENSE in the source repository root for authoritative license information. ******************************************************************************/