Industry Data Is Essential and Yet It’s Often Inaccurate. Here’s Why.

By Pam Wu

Abstract image showing blue, yellow, and red rectangles, circles and lines.

It’s hard to predict a business’s risk without knowing its industry. Consider if a business called Self-Sufficient Coffee Co. applied for group accident insurance: if you were that insurer’s risk officer, you would naturally be concerned whether the applicant was involved in mass producing coffee, serving coffee, or both.

Industry data provides insight into how a business makes money — for example, whether a business sells a good or a service as opposed to manufacturing goods. This data helps financial institutions make smarter decisions about whether to engage with a small business, as well as how to market to, underwrite, and onboard the small businesses they serve.

Although industry classification is essential, accurately defining a business’s industry remains challenging. Why? In general, industry data is complex, and the granularity of available industry data varies significantly. In this post I’ll break down precisely why industry classification has been hard to achieve, and thus resulted in years of inaccurate or vague industry data.

Breaking down the challenges of industry classification

Selecting a business’s industry is surprisingly difficult.

When we ask for industry data, we’re really looking for insight into how the business operates, and how they make money.

Consider our hypothetical coffee shop, Self-Sufficient Coffee Co., a local hangout that sources beans, roasts in-house, and sells coffee both by the cup and in take-home bags. In the morning, the shop bakes its own pastries to accompany the coffee it sells. Depending on the activity you focus on, Self-Sufficient’s industry classification could be a specialty drinks business (NAICS 722515 — Snack and Nonalcoholic Beverage Bars) or a manufacturer of coffee (NAICS 311920 — Coffee and Tea Manufacturing). Which is correct?

One way to define a business’s primary industry is to use a quantitative standard. You could determine the business’s industry by selecting the aspect of the business that generates the highest amount of revenue or profit. Sometimes these can be at odds with each other: the coffee bar could make the highest amount of revenue, but because coffee roasting can produce large volumes of beans that are sold to other stores, coffee roasting could actually make a higher profit than the bar.

Sometimes these factors change: if the manufacturing side expands, it may result in the coffee shop’s industry changing to manufacturing, even if it’s still better known as a coffee bar. The intangible factor of what a business is known for, or even what the owner intends the business’s industry to be, is always disregarded. These factors can make it difficult to validate industries from the outside. Even if you were a regular customer of the establishment itself, how would you know that this coffee shop had quietly switched industries unless you had access to its balance sheets?

Classification systems require tough decisions.

NAICS is an industry classification system that uses 6 digits (1–9) to convey taxonomic information about industry. The first digit says if it provides a good or a service, the second gets into more specifics, such as if it’s in manufacturing, retail, wholesale, construction, etc. The third through sixth digits break each previous digit’s definition into up to 9 sub-categories. Sometimes the 9th sub-category ends up being miscellaneous (as an example, NAICS 713990 — All Other Amusement and Recreation Industries simultaneously encompasses laser tag, dance halls, and riding stables). This system makes sense on paper, but reality often throws a curveball.

You may have noticed something in the coffee shop example up above: one of the example NAICS codes began with 311 and the other with 722. NAICS 311920 — Coffee and Tea Manufacturing is closer in the taxonomy to NAICS 311811 — Retail Bakeries than to NAICS 722515 — Snack and Nonalcoholic Beverage Bars, where “snack” is an umbrella concept that covers baked goods and “non-alcoholic beverage” is one that covers coffee.

Depending on how you think about industry, this may or may not sound intuitive. This is because industry is comprised of a number of factors, including: its inputs and outputs (i.e. a bakery takes flour, water, yeast, etc. and converts it to bread and pastries), how the input gets converted to the output (i.e. was it by machine or by hand), how it gets delivered to the customer (sold over the counter, transported to another location, etc.), and what kind of customer do you serve (retail establishments, direct consumers, banks, etc.). There are even unaccounted for factors in most industry classifications, such as the target market segment. Each of these factors can connect companies together, but if you want to make a taxonomy, you have to decide which factors will split the categories earlier and which ones will split later.

Data granularity is highly variable.

Every use case for industry data requires a different amount of detail, also known as “granularity.” It’s challenging to provide industry data that provides enough insight for most users, without becoming overwhelming. There are two ways to think about granularity: the detail of industry-level data available and the detail of industry-level data needed for a specific use case.

Again, let’s consider Self-Sufficient Coffee Co. If you’re determining whether to lend to the coffee shop, it may be enough to know that they are in the food services industry because you believe that food services businesses perform well enough to qualify for a loan. However, if you’re determining whether to insure the coffee shop, Self-Sufficient’s in-house roasting activities will likely affect your risk evaluation.

This desired level of industry granularity still has to contend with what data is available, presenting potential trade-offs between accuracy and granularity. If you require a more detailed level of granularity about a business’s industry than a data provider can offer, their accuracy becomes irrelevant. Likewise, a provider’s specificity of industry data doesn’t mean anything if they can’t deliver accuracy at that level of granularity as well.

A new option for industry classification data

We want to change how industry codes are being predicted for small and medium businesses across the U.S. Our classifiers deliver 2–3X better precision than incumbent data providers. Providing this accuracy and high levels of granularity when needed means we’re able to provide more powerful industry data than what’s been available to date.

We’ve detailed our approach to industry classification data in this earlier blog post, but you’re also welcome to explore Enigma’s industry classification data for yourself via our API.