What is data mining? Finding patterns and trends in data

Data mining, sometimes called knowledge discovery, is the process of sifting large volumes of data for correlations, patterns, and trends.

examining / analyzing / selecting / business data / statistics / analytics

Credit: TinPixels / Getty Images

Data mining definition

Data mining, sometimes used synonymously with “knowledge discovery,” is the process of sifting large volumes of data for correlations, patterns, and trends. It is a subset of data science that uses statistical and mathematical techniques along with machine learning and database systems. The Association for Computing Machinery’s Special Interest Group on Knowledge Discovery and Data Mining (SigKDD) defines it as the science of extracting useful knowledge from the huge repositories of digital data created by computing technologies.

The idea of extracting patterns from data is not new, but the modern concept of data mining began taking shape in the 1980s and 1990s with the use of database management and machine learning techniques to augment manual processes.

Data mining vs. data analytics

The terms data analytics and data mining are often conflated, but data analytics can be understood as a subset of data mining.

Data mining focuses on cleaning raw data, finding patterns, creating models, and then testing those models, according to analytics vendor Tableau. Data analytics, on the other hand, is the part of data mining focused on extracting insights from data. Its aim is to apply statistical analysis and technologies on data to find trends and solve problems.

The business value of data mining

Data mining is used at companies across a broad swathe of industries to sift through their data to understand trends and make better business decisions. Media and telecom companies use mine their customer data to better understand customer behavior. Insurance companies use data mining to price their products more effectively and to create new products. Educators are now using mining data to discover patterns in student performance and identify problem areas where they might need special attention. Retailers are using data mining to better understand their customers and create highly targeted campaigns.

Data mining use cases include the following:

Catholic Relief Services (CRS) is using data collection and machine learning to help it provide humanitarian relief around the world. It has developed Measurement Indicators for Resilience Analysis (MIRA), a high-frequency data collection protocol that gathers information about weather-related “shocks” to communities in southeastern Africa. It feeds the data into machine learning algorithms to determine which households will be at risk of food shortages because of those shocks.
Bank of America is using data mining, machine learning, and AI to more accurately identify investors for initial public offerings (IPOs). It has created Predictive Intelligence Analytics Machine (PRIAM), an AI deal prediction system that uses a network of supervised machine learning algorithms to understand relationship trends between equity capital markets (ECM) bankers and investors.
Mortgage processor Ellie Mae is using data mining on ransomware attacks to help it identify indicators of compromise (IOC). Those IOCs are combined with threat intelligence, predictive analytics, and AI to power the company’s Autonomous Threat Hunting for Advanced Persistent Threats project.

Data mining techniques

Data mining uses an array of tools and techniques. According to data integration and integrity specialist Talend, the most commonly used functions include:

Data cleansing and preparation. Before data can be analyzed and processed, you need to identify and remove errors, and identify missing data, too.
Data mining frequently leverages AI for tasks associated with planning, learning, reasoning, and problem solving.
Association rule learning. Also known as market basket analysis, these tools are used to search for relationships among variables in a dataset. A retailer might use them to determine which products are typically purchased together.
Clustering is used to partition a dataset into meaningful subclasses to understand the structure of the data.
Data analytics. Data analytics is the process of extracting insight from data.
Data warehousing. A data warehouse is a collection of business data. It’s the foundation of most data mining.
Machine learning. Machine learning helps automate the process of finding patterns in your data.
This technique is used with a particular data set to predict values like sales, temperatures, or stock prices.

Data mining process

The Cross Industry Standard Process for Data Mining (CRISP-DM) is a six-step process model that was published in 1999 to standardize data mining processes across industries. The six phases under CRISP-DM are: business understanding, data understanding, data preparation, modeling, evaluation, and deployment.

Business understanding

This phase is about understanding the objectives, requirements, and scope of the project. It consists of four tasks: determining business objectives by understanding what the business stakeholders want to accomplish; assessing the situation to determine resources availability, project requirement, risks, and contingencies; determining what success looks like from a technical perspective; and defining detailed plans for each project tools along with selecting technologies and tools.

Data understanding

The next phase involves identifying, collecting, and analyzing the data sets necessary to accomplish project goals. It also comprises four tasks: collecting initial data, describing the data, exploring the data, and verifying data quality.

Data preparation

This is often the biggest part of any project, and it consists of five tasks: selecting the data sets and documenting the reason for inclusion/exclusion, cleaning the data, constructing data by deriving new attributes from the existing data, integrating data from multiple sources, and formatting the data.

Modeling

Building models from data has four tasks: selecting modeling techniques, generating test designs, building models, and assessing models.

Evaluation

While the modeling phase includes technical model assessment, this phase is about determining which model best meets business needs. It involves three tasks: evaluating results, reviewing the process, and determining next steps.

Deployment

The final phase is about putting the model to work. It includes four tasks: developing and documenting a plan for deploying the model, developing a monitoring and maintenance plan, producing a final report, and reviewing the project.

ASUM-DM

In 2015, IBM published an extension to CRISP-DM called the Analytics Solutions Unified Method for Data Mining (ASUM-DM). It takes CRISP-DM as a baseline but builds out the deployment phase to include collaboration, version control, security, and compliance.

Data mining software and tools

Companies use a variety of data mining software and tools to support their efforts. Some of the more popular software and tools include:

H20. This open source machine learning platform can be integrated through an API and uses distributed in-memory computing for analyzing massive datasets.
IBM SPSS Modeler. IBM’s visual data science and machine learning solution can be used for data preparation, discovery, predictive analytics, model management, and deployment.
Knime. Open source platform Knime is aimed at data analytics, reporting, and integration.
Oracle Data Mining (ODM). ODM is part of Oracle Database Enterprise Edition, offering data mining and data analysis algorithms for classification, prediction, regression, associations, feature selection, anomaly detection, feature extraction, and specialized analytics.
Orange Data Mining. Orange is an open source data visualization, machine learning, and data mining toolkit.
R. This open source programming language and free software environment is widely used by data miners. Founded by Revolution Analytics, R also has commercial support and extensions. Microsoft acquired Revolution Analytics in 2015, and has integrated R with its SQL Server offerings, Power BI, Azure SQL Managed Instance, Azure Cortana Intelligence, Microsoft ML Server, and Visual Studio 2017. Oracle, IBM, and Tibco also support R in their offerings.
RapidMiner. Geared for teams, the RapidMiner data science platform supports data prep, machine learning, and predictive model deployment.
SAS Enterprise Miner. SAS Enterprise Miner is aimed at creating predictive and descriptive models on large volumes of data from sources across the organization.
Sisense. Sisense’s BI stack covers everything from the database through ETL and analytics to visualization.

Data mining jobs

Data mining is most often conducted by data scientists or data analysts. Here are some of the most popular job titles related to data mining and the average salary for each position, according to data from PayScale:

Business intelligence analyst: $52K-$90K
Business intelligence architect: $72K-$140K
Business intelligence developer: $$62K-$109K
Data analyst: $43K-90K
Data engineer: $44K-$141K
Data scientist: $66K-$130K
Senior data analyst: $63K-$108K
Statistician: $44K-$159K

Africa

Americas

Asia

Europe

Oceania

Topics

About

Policies

Our Network

More

What is data mining? Finding patterns and trends in data

Data mining definition

Data mining vs. data analytics

The business value of data mining

Data mining techniques

Data mining process

Business understanding

Data understanding

Data preparation

Modeling

Evaluation

Deployment

ASUM-DM

Data mining software and tools

Data mining jobs

Show me more

What is a chief data officer? A leader who creates business value from data

The trick to better answers from generative AI

How to succeed at digital transformation in India

CIO Leadership Live Canada with John Pinard, VP, IT Operations, Infrastructure and Cybersecurity, DUCA Financial Services Credit Union

CIO Leadership Live Australia with John Taylor, Former Group Executive - Technology & Security, MedHealth

CIO Leadership Live India with Sankaranarayanan Raghavan, Chief Technology and Data Officer, IndiaFirst Life

Meet the CIO50 Middle East winners

Leadership Advice: What are the best tips for becoming a CIO?

CIO Leadership Live Canada with John Pinard, VP, IT Operations, Infrastructure and Cybersecurity, DUCA Financial Services Credit Union

What is data mining? Finding patterns and trends in data

Data mining definition

Data mining vs. data analytics

The business value of data mining

Data mining techniques

Data mining process

Business understanding

Data understanding

Data preparation

Modeling

Evaluation

Deployment

ASUM-DM

Data mining software and tools

Data mining jobs

Related content

The startup CIO’s guide to formalizing IT for liquidity events

15 worthwhile conferences for women in tech

By enabling “ask and expert” capabilities, generative AI like Microsoft Copilot will transform manufacturing

Captive centers are back. Is DIY offshoring right for you?

From our editors straight to your inbox

Show me more

What is a chief data officer? A leader who creates business value from data

The trick to better answers from generative AI

How to succeed at digital transformation in India

CIO Leadership Live Canada with John Pinard, VP, IT Operations, Infrastructure and Cybersecurity, DUCA Financial Services Credit Union

CIO Leadership Live Australia with John Taylor, Former Group Executive - Technology & Security, MedHealth

CIO Leadership Live India with Sankaranarayanan Raghavan, Chief Technology and Data Officer, IndiaFirst Life

Meet the CIO50 Middle East winners

Leadership Advice: What are the best tips for becoming a CIO?

CIO Leadership Live Canada with John Pinard, VP, IT Operations, Infrastructure and Cybersecurity, DUCA Financial Services Credit Union