Are you cleaning your sidewalks fast enough?

Companies all look the same when the weather is good.

Most days it doesn’t matter. Rain, sunshine, wind. These might affect other parts of your business but not the safety of the path out front. People don’t notice too much, maybe there’s a small crack, a little trash in a corner, but it’s passable, clear and not a danger to people walking by.

Then the snowstorm hits….

The next day, do people see a clear sidewalk or a dangerous icy patch?

Are you running business as usual, but not fixing a clear danger out front?

What does this tell your customers or potential customers about how much you care? Is it telling them that you are understaffed or failing? What other health and safety issues are you neglecting? Will you blame your users for not being safe if they get hurt?

This is the case with CVEs and other vulnerabilities. When the CVE of the day gets publicized, are your response plans nimble? Can you patch a library and get a new release out quickly? Can you publish a VEX document saying “we’re not affected”? Is it always an emergency, or do you have a playbook at the ready?

As a user, which of your vendors was a month late getting you an update? Which one never patched at all? Did you read about breach in the news but not get clear alerts from them in the first place?

Your business might still be open after the storm, but your users will see how you handle the icy patches in the meantime….

What 20 years of “stolen” snippets teaches about managing AI generated code

 

For many years I led a team whose job it was to track down who actually wrote the software that major corporations were shipping in their products. As part of this review, we created Software Bills of Materials (SBOMs) containing lists of the large open source libraries most people think of when they think of SBOMs, but also information concerning bundles of individual source code files and even snippets of cut and pasted code. 

 

In the last two cases, developers had gone out to the internet, found a blog or internet forum containing code that solved their problem and cut and pasted it into their code base. They often asked a search engine a question and found a block of code created by a human that almost exactly answered that question.  With a quick cut and paste and some minor editing the developer would quickly be on their way with the programming problem solved.

 

The first major issue with these types of cut and pastes is the lack of compliance with open source licensing that this third party code was released under. The original authors published this code with the expectation that these licenses would be read, respected and complied with.

 

In the vast majority of cases, the code was brought over without any embedded licensing information and in many cases either silently copied in or seen following a comment by the developer jokingly disclosing something like “Stolen from blog xyz!”. At least you might say the second case actually gave a little bit of credit to the original author, though the lawyers involved would always get upset.

 

Additionally, the quality and security of this code could be questionable. Typically the code was created to demonstrate a point, or to provide simple guidance on how to do a task, but lacked security guardrails or error handling.

 

Through a lot of scanning and human analysis, many of these source code snippets were brought into compliance and safety while others required removal and an independent re-write in order to clear an open source license violation or quality problem.

 

Why does this happen?

 

Developers are under intense time pressure and will use any tools available in order to complete the work they have in front of them. This, combined with a common lack of understanding of open source licensing, lack of source code origin tagging and the difficulty of discovering and managing this code means that the path of least resistance allows this third party code to sit silently until an event like a M&A or lawsuit or contractual scanning requirement occurs that causes it to be discovered, often at great cost to fix.

 

Policies were often in place, but a paper process with no teeth and without tooling to enforce is pretty much worthless. 

 

Additionally, the original code often lacked inline copyright and license statements making it difficult for the average software developer to know what the expectations on them were.

 

By examining the way that many developers interact with code from the developer help forum “Stack Overflow” shows us that licenses are confusing and developers don’t know or care enough to look for them. When a developer cuts and pastes code from Stack Overflow, the code by default comes under a Copyleft style open source license, the Creative Commons Share Alike License (CC-SA). See https://stackoverflow.com/help/licensing 

 

This can cause headaches later on as the team tries to understand how this license affects their use of the snippet, makes effort to relicense this code by contacting the original author, or to somehow decide the code is “not protectable” which can be a difficult and risky process itself. 

 

This type of cut and paste and then legal review can cause a lot of wrangling and debate. “Why did they post this if they didn’t want me to use it?” or “Are we REALLY going to get sued?” or “Everybody ELSE is doing it, why shouldn’t we?”

 

These questions are very much in line with the concerns and questions teams are having around AI generated source code.

 

How is it similar to the AI Code Generation problem right now?

 

The vast majority of code generated by AI/LLM code generators is emitted with no tagging or information tracking its origin. This is unlikely to change in the near future. Additionally, there is often a claim by the companies generating this code that the emitted code is not under an open source license. This code is being generated in large quantities at all levels of an organization with no tracking and policies that, even if in place, are at odds with reality. For example a policy that says “You will not use AI/LLM to generate source code for this project” is unlikely to be respected unless a clear believable reason is communicated to each developer personally. A policy slide on a Powerpoint on a shared drive someplace might as well not exist.

 

Is there an AI Code Generation Problem?

 

We don’t know and that is a tough place to be in. Companies who are in the business of selling AI Code generation are claiming there is no open source licensing problem. Content creators in other industries like Newspapers, Books and Cinema are filing lawsuits against similar claims in non-source code related fields of use. The next few years will be interesting. 

 

From a security perspective, there is concern and some evidence that AI generated code may be of lower quality than that human generated code especially in areas that the developer requesting help lacks expertise.

 

What can be done to manage this problem?

 

 

  • Policies don’t often work – but are important

 

Policies around Open Source compliance and security work are important but are only given as much respect as they earn. A one-off policy that is buried, is too harsh, or that has no dedicated time on the roadmap and timetable is destined to be ignored either out of ignorance or perhaps the hope that the policy will change in the future or that the developer will be long gone before it’s an issue. Your Legal team might know about the policy but developers will almost never know. Think about the impacts of Bans! It is important to explain why a policy exists and how to ask for permission to change it or get permission for a variance.

 

 

  • What doesn’t work?

 

Expecting developers to add comments that are not present in the original code is very difficult. It’s a great requirement to have in your policy and guidelines, but you should expect that most people will not follow the request. This is especially difficult if the developers IDE itself is generating the code inline. It may be helpful to push back on the dev tool manufacturers to request tagging or inline comments for generated code, but it does seem that this is against current trends.

 

 

  • If the concern is Quality, require tools to enforce quality

 

Whether the vulnerable code is written by your developer, generated by an AI tool, or brought in by an open source project, scanners exist to help discover vulnerable code. They do require expertise and time to run and analyze results, BUT this time should be brought into the calculation as part of the decision to introduce code from the outside in the first place.

 

 

  • Tools might help, if they are used promptly

 

Whether its SCA scanners that do Snippet Matching to find unlabeled licensed code, or SAST/DAST scanners to discover vulnerable code, tools exist to help solve these problems, but are often left to the end when it’s too late to do anything, or overwhelm the user with perceived false positives and large amount of real issues to resolve. It takes a trained eye to understand these results and make the required security changes. There is no easy button.

 

 

  • Make it clear if there are areas where NO risks can be taken

 

You may have extremely important areas of your codebase that you can take no risks in. This might be your core engine, core forecasting model or “secret sauce” that you may wish to wall off from AI generated code. This is a perfect place to have an explicit policy communicated to anyone with check in privileges to this area. A file in this directory containing policies and expectations can be helpful as well.

 

 

  • Understand that some areas might be more likely to have problems

 

If you are using AI to generate code for Linux Device Drivers or to create functionality to integrate with a well known open source project, you may find yourself in a situation when the model gives you EXACTLY what you are asking it for. If these areas of functionality are highly likely to be licensed under the terms of the General Public License (GPL) or other copyleft/viral licenses, you may get back source code that sure looks like the GPL licensed ecosystem you asked for.  This is an area to perform deep scans on and to be prepared to release to the community under a strong open source license itself.

 

 

  • It’s hard to slice and dice the law and ethics

 

It can be confusing to employees why you respect some open source licenses, but not others. Or why you do in certain areas but not others. Keep abreast of changes in the industry, especially around licensing and indemnification. Some AI vendors may indemnify your use of their code generation tools but there may be hoops to jump through in order to make sure you are covered. What are the limits of this coverage if it exists?

 

Where do we go from here?

 

The genie is out of the bottle, AI generated code is going into organizations’ codebases every day. Ignoring its existence is unlikely to be the best long term strategy. It will find its way into your codebase, and be in a form that will make it very difficult to track. If problems arise, it will be difficult to remove and replace. Through education, thoughtful and well communicated policies and proper tooling your organization can use the current crop of AI tools while being prepared for changes in the licensing, regulatory and community aspects of their use. Watch how parallel industries deal with this issue and be prepared to quickly communicate policy and legal changes to your developers. Communicate the risks so that developers can better use their judgment while building your products.  The next few years will be a time of serious change in the software industry. Understanding how you use these new tools, and the potential impact on your bottom line, can help you best manage the impact these tools have on your software.

 

Your first SBOM is going to stink. Don’t panic, get started fixing it!

Your first SBOM is going to stink, that means you need to get started now to fix it up enough to share it. 

 

It seems like everybody but you is showing off their shiny new SBOM. You know you have to get started but you’re worried about what you’re going to find. I’m here to tell you that your first SBOM is going to stink, everybody’s does. If they tell you that it came out perfectly they’re either lying or their SBOM is woefully incomplete,  So let’s rip off the Band-Aid, get our scanning tools warmed up and work on getting a SBOM produced that you can stand behind and that won’t embarrass you or get you in trouble.

There’s a few common areas that SBOMs will have problems in.  These include Completeness, Depth, Unremediated Vulnerabilities, Open Source License Violations, and Over Delivery.

Each of these areas can cause rework, missed deadlines, loss of sales and even legal problems. 

The last thing you want is to deliver a product or a SBOM to your customer and have a previously unknown set of vulnerability and license compliance issues be sent back over to you with a timetable for resolution not of your own setting.

 

Let’s first talk about completeness

What I mean by completeness is that you examined all the code bases that are part of your project, you used a scanning tool that was capable of generating SBOM information for the type of libraries and artifacts you depend on, and that your Software Composition Analysis (SCA)  tool is configured correctly in order to produce SBOM information from whatever repository manager you are using. It’s common to get a short SBOM since the SCA tooling is unable to discover the open source in use due to lack of scanning ability or misconfiguration of the tool.

 

What are some questions you can use to gauge completeness?

Do I see artifacts for both my front end and back end in the SBOM or SBOMs. For example, do you see software components written in JavaScript if that is what you were using for your web app’s front end? 

Are you seeing a good list of Java components if you are using Java and Maven for your back end?

Bear in mind, you may have open source components in use that are automatically put on your SBOM through the use of a repository manager, and also have artifacts that are not managed by a repository manager that have to be manually incorporated into a SBOM. For example, you might have explicitly copied the source code for a component into your codebase, or load the library from a remote web location. Both of these cases require manual effect in order to have an accurate SBOM.

Additionally, ask developers to list some of the large open source dependencies they are aware of. Do you see them in your SBOM? If not, this is a very helpful indication that underscanning of some type is occurring.

 

Over Delivery in your SBOM

The completeness issue is closely related to the Overdelivery issue.  Sometimes your team will generate an SBOM that contains far more information than is appropriate for your individual application. This may be because the projects or directories to scan have been over specified. This also may be because you were using a repository technique called a Mono Repo which may contain many unrelated sub projects to the bill of materials that you are expected to deliver. There may be a large directory of third-party artifacts that are required to run every single application in your company, but the project you are concerned about right now, only requires a small percentage of those artifacts. Getting a SBOM for a small part of your MonoRepo may require advanced scanning techniques like Runtime analysis, etc.. in order to best cut away un-related disclosures.

You may find that there are multiple old distributions of your software checked into scan directories wildly inflating the artifact count and including artifacts from long dead versions of the application you’re scanning. This may require pruning or excluding directories scanned by your SCA tooling. Indicators of this issue may occur when you see many multiple copies of the same set of open source libraries differing only in version numbers. The names of the directories that these articles are seen in can also prompt you that this is the issue (e.g. /OldReleases/ or /PreviousVersions)

You may have scanned test or customer data directories that are part of the QA process and are unrelated to the running of your application or may even be inappropriately disclosing customer relationships or data.

You may have scanned artifacts that are related to the building and development environment of your application, which also may be out of scope for your SBOM delivery (though bear in mind, in the future, build and test environments are likely going to be required as part of SBOM deliveries!)

 

Is your SBOM Deep Enough?

Another very common underdelivery in SBOMs is not scanning “deep enough” or ignoring Transitive Dependencies. Transitive Dependencies are the dependencies of the dependencies you explicitly request. For example, you might depend on Component A, which in turn depends on Component B, C and D. These 3 dependencies might not show up explicitly in your Repository Manager configuration files but are resolved at build time and downloaded silently and automatically in the background. Depending on what SCA tool you use, and what settings you have turned on in that tool, you may find yourself not getting a complete list of required third party dependencies. Transitive dependencies may double to 10X your visible use of open source!

 

Have you Resolved All(?) Your Vulnerabilities! 

Now that you have a complete SBOM you will need to examine it for security and compliance problems. Top of mind for many organizations is the vulnerability status of each of the third party dependencies in their SBOM. There are many philosophies and more and more legal requirements in terms of defining how to resolve these vulnerabilities. The simplest process is to update all components so that there are no known vulnerabilities visible in the SCA scan. This may be difficult or impossible to get done in a timely manner, or may be impossible due to lack of available fixes. That said, many customers are going to expect a CVE free SBOM even if it is not possible to do so.

Other philosophies of vulnerability management includes performing runtime or reachability analysis. This means a SCA or similar tool will attempt to see if vulnerable components or buggy subcomponents are actually used or reached during the running of the application. A successful resolution of a vulnerability can be a statement that this vulnerability is not valid for your use case since that code is never used or reached during runtime.

Delivering a Clean SBOM may be possible with additional information explaining why known vulnerabilities do not affect your application. This may be due to not being in reachable code, not valid due to your runtime environment, or due to not being valid vulnerabilities in the first place. This is often the beginning of a discussion with your customer who may have additional questions or even pushback on your opinion. A common way of delivering this information is through a manual spreadsheet or through the use of a VEX document. See https://cyclonedx.org/capabilities/vex/ for more information on VEX.

 

Have you Resolved Any Open Source License Violations?

It is very common to see a large number of Open Source License Violations when running a SCA scan for the first time on a codebase. Some distribution models are more affected by license issues than others. For example, if you are distributing an application to end users or delivering a piece of hardware, there are many open source licenses you need to comply with.

If you are running a piece of software as a Software as a Service (SaaS) model, there is not likely a classic distribution of software, so many of the open source licenses will have no compliance requirements (with some notable exceptions like the AGPL license!)

In the distribution model, are you paying attention to any embedded Operating Systems like Linux?  If you are an IoT or embedded device product, this is extremely important to get correct. 

The most serious license violations are typically issues like GPL violations, where your organization is not complying with the terms of the General Public License (e.g. not sharing your application source code when making a distribution) 

Your organization should create Open Source License Use Policies for each of your distribution models and use cases. In many cases your SCA tool can help with the enforcement of these policies and create reports of policy violations.

Other issues are not creating license notice files, not putting copyright statements in about boxes, and other required attributions.

There may be other legal requirements (that technically may not be open source requirements) but are discovered during this analysis phase. These may be restrictions on certain types of commercial or business use, quasi-commercial terms, or even advertising requirements!

Additionally, you may find Commercial components embedded in your product which contain their own SBOMs and open source usage that may not be discoverable through the use of SCA tools. You in turn may have an open source license compliance, vulnerability and SBOM management conversation with your upstream vendor in order to be compliant with your downstream vulnerability and open source license compliance needs.

Open Source and Commercial Legal compliance is a complex topic and is worth the time to understand what is appropriate for your business and distribution model. Explicit legal advice is often warranted!

 

Putting it All Together

Once you have started using SCA tools, reviewing your SBOMs and then delivering them, you will start exercising a business process that makes future SBOM delivery easier.  One of the biggest causes of stress around SBOM creation and delivery is the fear of the unknown and the lack of knowledge on how to deal with problems. This is a perfect time to create an Open Source Program Office (OSPO) or at least a working group with similar knowledge and responsibilities. Building institutional knowledge on tooling, vulnerability management, open source license compliance and SBOM requirements goes a long way to making your business able to deal with the current and future regulations and contractual obligations regarding SBOMS. Good luck, and get started!

Curl is seen everywhere except your SBOM, why is it missing even though you use it?

What is curl?

curl is an open source command line tool and embeddable library for transferring data over a network. It is one of the most popular and well known open source projects and has over 20 billion installations according to its author Daniel Stenberg. It’s licensed under the curl license which is similar to the MIT license. Its latest version is 8.4.0 as of October 10, 2023 and its hosted on the web at https://curl.se/ 

 

Why are we talking about it?

Recently a high severity vulnerability was reported in the project. This vulnerability is tracked in the National Vulnerability Database using the ID CVE-2023-38545. See https://curl.se/docs/CVE-2023-38545.html  

 

There was a lot of chatter in the lead up to the public release of the vulnerability details, but in the end it affected fewer configurations than the early buzz warranted. That said, much like the log4j vulnerability a few years ago, it could have been possible to have a very serious zero day vulnerability in a widely used open source component.

 

Let’s just look for curl in our SBOM and get on with our day!

Unfortunately, it’s not that easy for components like curl. There are many components that are easily found with scanners and Software Composition Analysis (SCA) tools, but curl is not one of them. It is not easily found due to the programming language it is written in and the languages that are often used to embed it into larger projects.

 

Why does SCA have trouble finding curl?

The most common SCA scanning products these days traditionally look at information provided by repository managers like Maven or NPM. Repository Managers are tools for automatically downloading and installing open source libraries as part of the build process.  SCA tools reformat this repository information into the traditional SBOM formats. Additionally, these SCA tools will add information, such as known vulnerabilities, project health information and updated license metadata. Some languages such as Java, JavaScript, Python and Go have popular repository managers and have SBOMs easily created using quick lightweight SCA scanning.

 

On the other hand, languages such as C and C++ do not commonly use repository managers to handle their third-party dependencies. This means it requires much deeper, slower and sometimes human based analysis in order to discover and manage third-party dependencies. Right now it is very difficult, if not impossible, to get SBOMS when scanning C and C++ applications, especially when being built from source code. 

 

curl and libcurl are very often compiled into C and C++ projects and unless a human explicitly puts them in a SBOM you will not know that they are in use. 

 

Where might curl be hiding?

 

As mentioned before curl is an immensely popular and successful open source project and is embedded in untold thousands of commercial and open source components. It’s also embedded multiple times in almost every Operating System! Let’s walk through some of the most common places you will find curl.

 

Operating Systems: curl is embedded in almost every operating system. Updates to fix the curl vulnerability will almost certainly be released for currently supported versions of these operating systems. Older versions of OSes will likely remain unpatched and potentially vulnerable.

Do you have dedicated devices with operating systems that do not get updated? Do any require manual intervention to upgrade?

 

IoT devices: It is extremely common for IoT devices to have dependencies like libcurl in order to download system updates or other network operations. If no upgrades are available, it may be worth a conversation with the IoT vendor to understand their current SBOM and patch process.

 

Virtual Machines (VMs): VMs are a way of packaging up an entire operating system and a set of applications in order to run multiple virtual computers on a single piece of hardware. A VM looks like a real computer running a standard operating system and will likely have multiple copies of curl and libcurl bundled with the OS, libraries and running applications. You will be unlikely to receive a SBOM for a VM and the applications inside of it. The OS will have one set of dependencies, the required system level services will have another and finally the application will have its own independent SBOM. All of which should be reviewed and updated as needed. If no SBOM is available, use this exercise as the push to make one. 

 

Containers: Containers are a special lightweight method of running applications bundled with all their dependencies. While they are different from a Virtual Machine, it may be helpful to think of them like a VM. It is very common to see curl or libcurl as dependencies in a container, and in fact, this is one of the places where we will see curl automatically discovered and put on a SBOM though container scanning using tools like Syft and Grype (https://github.com/anchore/syft and https://github.com/anchore/grype ). Just because you see one or more copies of curl mentioned in your container’s SBOM, there are likely many other undisclosed copies of curl hiding in the operating system and applications running in the container.  The curl seen in the SBOM is likely system level services explicitly requested by the person who put the container together, but these container scans may only be looking at top level components.

 

Command line tools and Scripts: It is very common for applications to make external calls to command line tools, like the curl command line, in order to perform updates or remote download functions. These dependencies are often overlooked when putting together a SBOM and are almost never found though SCA scanning.

 

Commercial Products and their OSS dependencies: A commercial product may or may not have a SBOM or open source license disclosure. If it does, take a look for curl, libcurl or daniel@haxx.se in the SBOM or open source license file. Again, any disclosed curl may only be one of many actual curl dependencies in a large project. 

 

Open Source Projects: It is very common to see other open source projects use curl for internal network communication and downloads. Sometimes these projects will disclose their use of curl in a SBOM or Open Source license file, but in many cases they will not let end users know.

 

Wrappers for curl in other ecosystems: It is very common for other program language ecosystems to create “wrappers” for curl in the native programming language and ship a compiled version of curl or libcurl to provide the actual functionality.  If you are using a language like Java, Python, Go, etc and you see curl mentioned as an open source project name this project is likely a wrapper from a different group that either depends on a local version of curl or bundles an independent version of curl. These might require separate upgrades for each wrapper, and each system level installation of curl. 

 

Strings that indicate that curl is being used in a product

If you see these strings in a SBOM or Open source License file these are great indicators that curl is being used in a product or project.

Curl

Libcurl

daniel@haxx.se

https://curl.se/libcurl/

http://curl.haxx.se/libcurl/

 

Questions to ask your team to help uncover usage of curl

  • Are we doing any automatic downloads in our product? What tools do we use?
  • Does the system patch or upgrade or update itself? What library is it using to do so?
  • Does the manual or installation instructions mention curl as a dependency or pre-condition for use of the project? 
  • Is the curl RPM (or equivalent) required to install or build the project?
  • Does the product do web scraping or downloading of web site resources? If so, what library is used to perform these functions?
  • Do we see curl or libcurl or any variation of that name in our SBOM or license disclosures?
  • Do we see the email daniel@haxx.se anywhere in our license disclosures?
  • What do we see if we grep for libcurl, daniel@haxx or other curl strings in our codebase?
  • Does our product require a container to run in? Have we run a SCA scan of the container?

 

Use this experience to understand what visibly you are getting with your current SBOMs and SCA scans

Every day we get a better understanding of our use of open source and third party software through the use of SBOMs and SCA scanning. There is still a long way to go before we get complete visibility of every product’s SBOM though. This is due to the newness of this process, the complexity of how software is packaged and delivered, and the limitations of current SCA products. curl is used everywhere, but due to how it is packaged and the programming language ecosystems it is used in, it (and other C/C++ dependencies) is not showing up in the SBOMs we review to keep our companies and projects safe and updated. Use the questions in this guide and the areas where tools like curl might be found to help understand the current weaknesses in SBOM completeness and to get ahead of the next vulnerability!

 

You are the dog that caught the car: Handling the SBOM you asked for!

 

 

We all think we have more time.

 

One way or another, you are going to soon receive an email telling you that the Software Bill of Materials (The SBOM!) that you asked for is ready for you. Maybe it’s coming from a Vendor, maybe it’s an internal project, maybe it’s from your own team. You suddenly have a very important document to review, and it’s hard to even know where to begin. 

 

A big cause of paralysis in the security world is not knowing what to do next, especially with a seemingly buzzword heavy issue like this. Certainly everyone else knows what they are doing, but where do I even start?

 

I’m going to walk you through the basics of SBOMs, how to view them, some good first questions and where to go next.

 

What did you just receive?

 

A Software Bill of Materials (SBOM) is a listing of the third party software components that a software project uses in order to function. Typically this list of components consists of open source packages and libraries, but may also contain Commercially licensed components and possibly components with licenses that are neither Open Source or Commercial and may require further review.

 

We use this list of components to understand the software dependencies of this project, use it to identify potential security vulnerabilities, find end of life and unsupported software components, discover components that can be supported with money or software contributions, as well as discover other architecture or support issues.

 

This listing should include at least the name of the software component, its version and possibly information about the license it is released under. Beyond these basics, you may find additional pieces of information for these components such as Project URL, description, known vulnerabilities, etc. Since exchange formats and OSS scanners are still being defined, you may find yourself with varying levels of disclosure with varying levels of data quality and completeness.

 

The first thing to get a handle on is what type of SBOM you have received. There are a few different file formats and mechanisms for SBOM sharing, and more appearing every day.

 

What type of formats might you have received?

 

There are three main formats for SBOM sharing right now. CycloneDX, SPDX and free text files of varying complexity and origin. Typically, these files are designed to be both human and machine readable though it seems like the machines often have an easier time of it!

 

CycloneDX (https://cyclonedx.org/) is a file format created by the OWASP Foundation. You will know you have a CycloneDX file if your partner tells you that that is the format they will be giving you or possibly if the filenames are bom.json, bom.xml or end with .cdx.json or .cdx.xml

 

SPDX (https://spdx.dev/) is a file format created through the Linux Foundation. You will know if you have a SPDX file if your partner tells you that is the format they will be giving you or if the file name is similar to the following .spdx, .spdx.json or .spdx.rdf.xml.  

 

Free Text, CSV or Excel Files are traditional text or spreadsheet files that contain SBOM information in one-off or less common SBOM formats. They may be created by a tool or a human and are often designed for human review instead of computer processing. 

 

All of these file formats will contain information about the software components in use, many of the files will be “self documenting” meaning they will have Field Names (like Component Name or Version) near the data you are reading, or in a traditional spreadsheet format will have Column Names for each piece of data. 

 

In a JSON, XML or free text file, component data often is spread out over multiple lines of text.

 

In a spreadsheet, each row is often a single component, where each column is the component’s metadata (e.g. name, version, etc…)

 

How to view and process the SBOM

 

The easiest way to get insights from the SBOM you just received is to run it through a SBOM scanner tool like Bomber. Bomber is a free and open source tool that can provide information about known vulnerabilities and license information for the open source components found in the supplied SBOM. Bomber can handle CycloneDX files in either JSON or XML format, SPDX SBOMS in JSON format, as well as Syft JSON SBOM files. If you have a file in a different format, you can use a free tool to convert it to one of these, or request your partner to resubmit it in a format you can handle.

 

See https://github.com/devops-kung-fu/bomber for installation and usage information. If using command line tools is new to you, this might be a perfect time to call one of your developers to work together.

 

Examining a SBOM file by hand (if not using a tool like Bomber)

While using a tool is much easier, it is possible to examine the SBOM files using a text editor and picking it apart by hand.

 

When looking at the JSON or XML files themselves in a text editor, you can find the component name, URL, version information and license information. For example, in CycloneDX the following tags are found near the information of interest:

 

“name”   (the component name)

“version”  (the component version)

“bom-ref” (the URL or similar locator for the component in question) 

“license”  (The license or license options for the component)

 

The license tag may be after a stretch of “hashes” or IDs used to describe the files that make up the component. 

 

By using this information a web search can be used to find out vulnerability information. For example if you found Struts 2.3.31, you could do a web search using the terms “Struts 2.3.31 cve” and find out that this version of the component is affected by the vulnerability known as CVE-2017-5638 ( See https://nvd.nist.gov/vuln/detail/CVE-2017-5638 )

 

CycloneDX

 

For a deeper description of the CycloneDX SBOM format see https://cyclonedx.org/guides/sbom/OWASP_CycloneDX-SBOM-Guide-en.pdf

 

SPDX

 

For a deeper description of the SPDX SBOM format see https://spdx.github.io/spdx-spec/v2.3/

 

Many of these SBOM documents can be read in a standard text file viewer, or in a worst case, a Word Processor application. If the file is jumbled together or is in one long line, you will want to explore finding a more powerful text viewer that can better handle line breaks or special characters. Free tools like Visual Studio Code ( https://code.visualstudio.com/ ) can view Text, JSON and XML files. You may need to reformat the text if it is all in one line or jumbled together. In Visual Studio Code, Go to the Command Palette and select Format Document. The file should be more readable to a human now.

 

There exist JSON and XML file viewers which can make these files prettier to see and more useful to search or view.

 

Additional utilities are being released to support the use and viewing of SBOM documents in CycloneDX and SPDX formats.

 

A CSV or .XLS document can be opened in a spreadsheet application like Excel or Open Office.

 

Now That You Can View the SBOM What are you looking for?

One thing you can do is put it in a drawer! The very act of asking for an SBOM does a lot to kick the vendor into managing their third party risk. While this process works best if you ask them questions or give some pushback, asking for an SBOM allows them to say internally “our customer is asking for this, we need to do SBOM generation, SCA scanning, OSS Patch management, etc…

 

That said, you have it, let’s go get some value out of it!

 

This might be the point to bring in a developer if you are not familiar with open source libraries. There are a few things you can do on your own or you might find it is helpful to work together to understand the SBOM you just received.

 

There are a few questions we use SBOMs to help answer when looking at a piece of software

 

  • Does the SBOM seem legitimate? Can you view it, read through it, see real data?
  • Is it recently created? When was it generated? What version of the project was scanned? (e.g. is the SBOM wildly out of date?)
  • Does the SBOM only contain “Top Level Dependencies” or does it include the dependencies of those dependencies, also known as Transitive Dependencies? This could be a difference of 3-10 times the number of actual dependencies seen!
  • Can you find a few well known open source components and check their versions against the National Vulnerability Database (NVD) ( https://nvd.nist.gov/vuln/search )
  • Are there well known vulnerabilities in the codebase (old versions of Log4j, OpenSSL, Apache HTTPServer, Apache Tomcat, Apache Struts) 
  • Does the list of component versions seem “too old”? Are all reported vulnerabilities from years ago (e.g. CVE-2017-5638 in Struts) 
  • Are there Open Source Licenses that might cause a problem for you? (Do you see licenses like the General Public License or Affero General Public License which might be contrary to your company’s license policy. This can be complicated since some parts of your company may happily use GPL software in Linux Operating Systems but may forbid it in distributed applications)  
  • Does it seem complete? Is it missing important information like version information? 
  • What software languages are seen? Do you see what you expect? Java libraries? NPM libraries? Is something missing?

 

 

 

Pushing back or Requesting More Information

After processing or examining the SBOM you may have some questions or feedback for the team that supplied it to you. Typically you might request more information about the highest security vulnerabilities or license issues found in the report. There’s a lot of discussion about how customers and suppliers can work together best to share and respond to SBOM and vulnerability questions. In general, especially if this is your first experience with SBOMS, you might find the most value in letting your supplier know you’ve run the SBOM through a vulnerability tool and you have some questions about what you are seeing. The idea is to gently (and perhaps later on, not so gently) work with your partner to reduce exposure to known vulnerabilities, as well as better provide their customers or end users with an explanation of why they are or are not affected. As you get more experience, you may find that providing 3 to 5 clear concerns can help your partner start to get a handle on your expectations, as well as chip away at the worst problems. For example, if you see that the software contains old high severity vulnerabilities in Log4J, Curl, OpenSSL etc,, this might be a sign that they have not been using SCA scanning or good vulnerability management practices.

 

Feedback from the Supplier

 

In general, throwing a list of 100s of problems back to a vendor will not be well received, especially if you are new to SBOM reviews. That said, getting feedback on 5-10 of the worst of the worst can give you a good feeling if they are managing their supply well or not.  Many vulnerabilities may be present in an open source library, but not affect the software as you use it. A company should be able to clearly explain why they think they are not affected. “Trust me Bro!” is usually not a satisfying answer though. There should be clear explanations. For example, a good answer might be something like “This reported CVE only affects this component when run under the Windows operating system, and in this case we are using Linux”. 

 

As time goes on the SBOM you receive from this supplier should contain fewer vulnerabilities, a more complete listing of third party dependencies, as well as explanations on why potential vulnerabilities seen in the codebase are not valid for their current usage.

 

 

Keep Requesting High Quality SBOMS

 

As mentioned before, one of the best side effects of requiring a SBOM to be delivered to you is that the team responsible for creating the SBOM will now put in place Software Composition Analysis (SCA) scanning tools, CVE/Vulnerability Patch Management, and processes in place to create/fix/deliver up to date SBOM information to you. A better understood product is a more secure product. The more SBOMS you see, the more that quality issues will pop out to you. Keep reviewing and keep giving and demanding strong feedback!



Open Source License Location Alignment Chart

 

Text Version:

 

Open Source License Location Alignment Chart

 

Where’s the open source license?

Lawful Good: SPDX Identifier at the top of each file
Lawful Neutral: in a LICENSE file at the top level of the source tree
Lawful Evil: at the bottom of each file

Neutral Good: on the project’s home page
Neutral Neutral: on the project’s Wikipedia page
Neutral Evil: as a reply to a GitHub issue asking for the license text

Chaotic Good: available as output of a python script
Chaotic Neutral: author states no license applies since code was written in a country with no copyright law
Chaotic Evil: in a scanned image in a TIFF file only found on the WayBack Machine

 

 

A proposal for Comment Tagging AI Generated Source Code

A proposal for Comment Tagging AI Generated Source Code

Source code generated by “AI” tools like GitHub CoPilot or OpenAI ChatGPT should prepend a language appropriate comment block explaining that the source code was generated by a tool as well as helpful metadata to allow discovery and management of that code by code scanners like SCA or SAST tools.

End users would obviously have the ability to remove this comment block, though I believe we all would be well served by marking all generated code with comments detailing the tool that generated it, the version or date generated, as well as inputs that might be helpful in understanding the conditions that caused this code to be generated.

AI Generated code, while still a new universe, appears to have a series of potential defects that existing and future

code scanners will need to be on the look out for. These include code hallucinations, missing cases, confidently wrong constants/algorithms, concerns around the license of the generated code, and other issues we still have not discovered.

Having a clear machine readable key that indicates that this code was AI generated allows for appropriate scanning, filtering, as well as metrics generation.

Code or data generated by AI based tools may have different standards of trust than code or data created or curated by human authors.

Parallels to SCA Snippet Analysis

There are parallels in Software Composition Analysis (SCA) snippet scanning world where awareness of generated code is very helpful when scanning or clearing scan results.

In the snippet analysis world, generated code is extremely similar to massive amounts of other open source code generated by the same tool. Therefore, performing snippet matching is often slow and resource intensive due to the sheer amount of similar snippets. This causes user pain due to slow scanning as well as a perceived large amount of “false positive” matches. There is also a belief that this generated code is “fine” which means it is often incorrectly ignored when it comes to SCA/SAST scanning due to the above issues.

Code generated by traditional non-AI code generators like the .NET IDE, Antlr, Apache MyBatis, protobuf, etc.. often tag their generated code with special comment strings and tags.

This allows SCA tools or SCA tool users to either ignore snippet matching for these files before scans are performed, automatically bucket or filter results afterward, or allow the end user to manage the results quickly through string matching.

One issue with these code generators is that the tags used are not standardized and require multiple methods to discover. The identifying strings include XML fragments, strings, JavaDoc style tags, custom tags, etc…

Future SCA/SAST tools can be even more nimble as they become more aware of the possible code generators that exist and perform appropriate scanning methods to the generated code depending on what needs to be discovered.

 

Qualifications of a good “generated by” comment

  • Easy to parse by machine (oh the irony!)
  • Easy to read and understand by a human
  • Not too wordy so it will be left in place by the end user
  • Not too wordy so that code generators decide to use it
  • Explains what tool generated the code using a unique name
  • Provides a version number or generated date so that eras of similar code can be examined with appropriate tools
  • Does not change too quickly so that code generated by the same tool can easily be found with simple pattern matches or even greps

Future extension

In the future, the user text prompt that caused the code to be generated should be embedded as well

Current AI code outputs are typically single pages and should therefor have a single line comments.

Future code generators will generate entire applications and should have a larger banner with more details explaining the user prompts that generated the application.

A tool URL or project home URL (e.g. @generatorURL ) could be optionally used to prevent naming confusion and/or provide easy branding or publicity for the various tools

Current Proposal:

// @generatedNote This code was generated by a AI code generator tool.
// @generatedBy CoolAIGenerator v1.2.3

 

Examples of current comments from non-AI Code Generators:

https://github.com/VarathaRamanujam/EEE_LOGIN_VALIDATION/blob/fc3787e3acfd302cce5abfd6bbb5b8abf2bef72c/src/main/java/com/hider/eee_students_login/Filemodel.java

/*
* Created on 2022-11-27 ( 18:26:59 )
* Generated by Telosys ( http://www.telosys.org/ ) version 3.3.0
*/

https://github.com/koo5/alertmanager_api_js/blob/7227cc6e1854fe29f45d011584ba563d385a5fdb/src/model/Alert.js

/**
* Alertmanager API
* API of the Prometheus Alertmanager (https://github.com/prometheus/alertmanager)
*
* The version of the OpenAPI document: 0.0.1
* 
*
* NOTE: This class is auto generated by OpenAPI Generator (https://openapi-generator.tech).
* https://openapi-generator.tech
* Do not edit the class manually.
*
*/

https://github.com/Brayds-Dev/Merchandiser-tool/blob/381d9e83698d35bfc10312ec195788cb06c31e39/MerchandisersTool/MerchandisersTool/obj/Release/netstandard2.0/Views/UpdateClientInfo.xaml.g.cs

//------------------------------------------------------------------------------
// <auto-generated>
// This code was generated by a tool.
// Runtime Version:4.0.30319.42000
//
// Changes to this file may cause incorrect behavior and will be lost if
// the code is regenerated.
// </auto-generated>
//------------------------------------------------------------------------------

https://github.com/wq3426/study/blob/e9830a44a3c9e7314fd1f3974b7d68c0eeafc1a6/spring_boot_project/bwmTools2/src/main/java/com/dhl/tools/domain/CargoLocationData.java

/**
*
* This class was generated by MyBatis Generator.
* This class corresponds to the database table CargoLocation_Data
*
* @mbg.generated do_not_delete_during_merge
*/

https://github.com/cqym/cut_tools2/blob/64a447205ae8a9a5c6089097f966c9c3b786357c/development/src/com/tl/resource/dao/pojo/TQuotationProductDetail.java

/**
* This field was generated by Apache iBATIS ibator. This field corresponds to the database column t_quotation_product_detail.id
* @ibatorgenerated Wed Oct 14 14:13:27 CST 2009
*/

https://github.com/thaovy2902/Web_KTMT/blob/ffcc022506b4b5635e67bc046f01a3ed80bb3605/vendor/google/cloud/CommonProtos/src/Audit/RequestMetadata.php

# Generated by the protocol buffer compiler. DO NOT EDIT!
# source: google/cloud/audit/audit_log.proto

https://github.com/mehmetsen80/xtalk/blob/cf88a0025b44a9e4ee0055457c565dbf30c05e1e/MSOutlookRecurrencePatternType.h

/*******************************************************************************
**NOTE** This code was generated by a tool and will occasionally be
overwritten. We welcome comments and issues regarding this code; they will be
addressed in the generation tool. If you wish to submit pull requests, please
do so for the templates in that tool.
This code was generated by Vipr (https://github.com/microsoft/vipr) using
the T4TemplateWriter (https://github.com/msopentech/vipr-t4templatewriter).
Copyright (c) Microsoft Corporation. All Rights Reserved.
Licensed under the Apache License 2.0; see LICENSE in the source repository
root for authoritative license information.
******************************************************************************/

 

 

I’m speaking at Open Source 101 on 3/30 at 2:45pm-4:30pm ET

Are you thinking about selling your company? Are you building a software product? Do you lead an engineering team? Want to jumpstart your knowledge of Open Source and its impacts on security, compliance and valuation?

 

Join my workshop “Just Enough Open Source: A Kick start on security, license compliance and business models” Tuesday March 30th, 2021 from 2:45pm – 4:30pm ET.

 

 

Talk description:

“Open Source powers the world, but you need to do more than download it

In this talk we will provide background on the most common types of open source licenses, business models, security issues and most importantly the processes required to help you remain secure and in compliance. We will discuss best practices, scanning tools, remediation, customer and partner expectations around OSS compliance and how to manage OSS during events such as a product release or M&A.”

FREE registration link: https://opensource101.com/register-now/

My Open Source Talk at All Things Open 2020

Recently I was asked to put together a Workshop on Open Source Licensing and Security for All Things Open 2020. I always love these types of talks since it gives you a chance to both kick start people new to the topic and also give a current overview to people who have been doing this role for a while. 

 

The video below has two sections, the first gives an onramp and introduction to open source licensing, and the second half discusses more of the day to to day operations of open source compliance and security. 

 

I’d love to hear your feedback!

Your code will outlive you! Will the Future be able to use it?

Every decade or so the technology world gets punched in the face by a problem requiring poring through massive amounts of code written well before many of us were born.

In the late 1990s it was the Y2K problem when two digit years were no longer sufficient.

 

In the 2008 financial crisis COBOL based systems required hand editing in order to change state employee salaries en mass.

 

In 2020 we had the Pandemic related economic impact requiring Bank and Employment system’s code to be modified, many of which again were written in COBOL!

 

In 2032 we’ll have the Y2k38 problem when the Unix time will overflow causing software issues akin to Y2K.

 

While many of these systems were closed source and proprietary, open source systems are starting to dominate the software landscape. Much like your grandparents’ hammer, these examples show us that useful tools will almost always outlive the person who first selected or created them.

 

Besides the question of maintainability and programmer experience with very old languages like COBOL, questions of intellectual property and software licensing will complicate the usability of open source software over the next 50 years and beyond.

Think about Licensing!

As part of software due diligence (when a company purchases another company and confirms that the source code they are buying is correctly owned, licensed and documented) I have had the experience of trying to track down the ownership and licensing of code decades old. This type of software archeology requires access to archives of source code, books, magazines, blogs and other places that programmers have published software over the last 60 years! In many cases the true origin of some source code is lost to time, or can be only partially known.

 

Questions such as “Who wrote this?”, “Did they expect others to freely use this source?”, or “Does a commercial company own this?” are common.

 

It is important to make these types of answers clear for those who come after us.

 

The most important of these is to specify an open source license for the code you are publishing, even if it’s just a “single page” or block of code. If it’s worth putting on the Internet, it’s worth telling people what its license is.  There are many suggestions on how to label code to make the copyright and license clear, but I strongly suggest that each file contains a copyright statement and at least a SPDX license identifier (See https://spdx.org/licenses/)

 

This allows someone in the distant future to know who wrote the source code and what the obligations are even if only a single file remains.

Who do you depend on?

Document all your third party dependencies, including dependencies of dependences. There is no guarantee that any of our current repository managers will still be working decades in the future, but your code may be. By listing these dependencies, you help the Future build and run your code.

 

In a similar vein, your build system and running environment should be documented as well. For example, if you depend on a certain make system or database to be installed, call these out in separate build and running environment documentation. 

Till Death Do Us Part

In the short term, understand that source code is considered property. What happens after you die should be clearly specified. In most places the ownership of your code will pass on to your heirs, but possibly with complicated and divided ownership.  Do you want anyone in particular to be the new code owner (or someone outside of your family)? In this case your will (or related documents) should make this clear.

 

While everyone should have the permissions available under your open source license for the duration of your copyright, you may wish your heir to have the ability to “own” the code just like you do.

 

By making the ownership clear, they will then have the ability to change the license for your code, just like you likely do now. This means they may have the permission to also sell commercial licenses to this code, or change the open source license of the project. An open source project with multiple contributors has additional concerns about ownership. You may need to make clear dividing lines between projects you own outright as opposed to projects you contribute to, or have others contribute to.

 

Do you want to change the license after your death, or after a certain amount of time?  Make these changes clear as well. Some may want to open previously closed source, or change to a Creative Commons Zero (CC0) license or Public domain declaration.

 

Similar questions may come up in the case of divorce. While it’s often clear who owns code you write when “on the clock at work”, the code you write at home may have complex ownership issues.

Go beyond the code!

An additional thing to consider is account access, logins, domain names and payments.

Typically we tell everyone to keep their account information private and secure. This may be at odds with your desire to keep your project going even after your death.

Keeping a list of domain names, third party services and other account information related to your project can help the heirs to your code keep the project going.

Bear in mind that, after your death, certain accounts may be locked, go away or may be controlled by people other than the code owner.

 

As with most things involving intellectual property and life events, it is best to consult a lawyer to understand your best options.

Rest in Peace!

A little care and effort in the present can save the community a significant amount of time in the future. By specifying a license, documenting project dependencies, and clearly transferring ownership you can make sure your code stands the test of time.