Machine Learning for Product Managers

Understanding the Fundamentals of Machine Learning

We're taking a deep dive into the fascinating world of Machine Learning (ML). While it might sound complex, understanding the basics can give you a significant edge in your career. Let's break it down step-by-step:

The Essence of Machine Learning

At its core, Machine Learning(ML) is about teaching computers to learn from data. Imagine you're teaching a child to identify cats and dogs. You show them many pictures, and they start noticing patterns like fur, tails, and ears. Eventually, they can identify new cats and dogs. ML works similarly, but with mathematical models instead of a child's brain. These models look for patterns in data to make predictions or decisions.

The Learning Process: Empirical Risk Minimization

Here's where it gets interesting. ML models learn through a process called Empirical Risk Minimization. Don't let the term scare you – it's simpler than it sounds. The model makes a guess, checks how far off it was (this is the "loss"), and then adjusts to do better next time. This process repeats many times. The goal is to minimize the average error (or "risk") across all guesses. It's like playing a game where you're trying to get as close to the target as possible, adjusting your aim with each throw.

Types of Machine Learning

Machine learning is divided into supervised and unsupervised machine learning.

  1. Super-vised machine learning is when the correct output is explicitly given, often times called labels

  2. Unsupervised is when these labels are not given.

Of course, that leads to different kinds of things that can be done with the data. To understand this, we can use the same learning example. Supervised ML is when correct answers are given, whereas unsupervised ML is when there are no correct answers possible, and you have to make sense of the problem on your own. An example is solving a numerical problem v/s writing an essay based on your lessons.

The kind of problems common in ML are classification problems, regression problems and clustering problems. Classification problems are where you have to classify the data (numerical data, images, text, sound, etc.) into different classes like positive/negative, matching/not matching, dog/cat, etc., given labelled data of the same. We often use the labels as ”correct” probabilities and the model uses these labels to learn how to make predictions.

What are Models, & How exactly do they learn?

In simple terms, a machine learning model is like a tool that helps it make decisions based on data. Imagine you want to create a model that can tell if an animal is a cat or a dog. You feed the model different characteristics of the animals, like their height, weight, and the length of their nose. The model uses these inputs to calculate a value, which helps it decide whether the animal is a cat or a dog.

To improve the model's accuracy, it goes through a learning process where it adjusts its internal settings based on the data it sees, trying to minimize errors. However, if the model learns too much from the specific data it's given, it might struggle with new data – this is known as overfitting. On the other hand, if it doesn't learn enough, it might not be able to make accurate predictions, which is called underfitting.

How are models evaluated?

To evaluate a machine learning model's performance, data is split into two sets: a train set for learning and a test set for evaluation. The model learns patterns from the train set and is then tested on the unseen test set to see how well it predicts new data. For example, in a cat vs. dog classification, the model might be trained on 8 animals and tested on 2, allowing us to measure its accuracy—the fraction of correct predictions.

Beyond accuracy, a confusion matrix helps break down the model's performance into four key categories: True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN). From this, we can calculate Precision (how many predicted positives were correct), Recall (how many actual positives were correctly identified), and the F1 score, which balances Precision and Recall.

In some cases, a validation set is used during training to monitor the model's performance before final testing. This helps ensure the model generalizes well to new data. Finally, inference is when the model is applied to real-world data to make predictions, showing how well it performs in practical scenarios.

Business Considerations and Best Practices for Implementing Machine Learning

Adopting machine learning in a business comes with its own set of challenges and benefits. To maintain steady service and ensure consistent profits, businesses must follow best practices for managing machine learning models and the data used to train them.

First, there's the ETL (Extract, Transform, Load) process, which involves extracting data from various sources, loading it into memory, and applying necessary transformations like data cleaning and image processing. While this technical step might not require much business attention, it's crucial for ensuring that the data is prepared correctly for training.

Managing the data used in machine learning is another key aspect. Since training data is constantly evolving—with new data being added and old data discarded—proper versioning is essential. Tools like Data Version Control (DVC) help track which models were trained on which datasets. Given the large size of data often involved (sometimes in the GBs or TBs), specialized tools like data lakes and data warehouses are used to manage this data efficiently.

Machine learning code, like any other software, can have bugs and requires version control and continuous integration practices. This ensures that the code is maintained and updated efficiently. Additionally, ML models need to be hosted, often within backend infrastructures using frameworks like Flask or Django. If a business plans to offer ML models as a service, they'll need to ensure high uptime and adhere to service level agreements.

Furthermore, ML models require ongoing performance tuning and retraining to remain effective as data distributions shift over time. This continuous training process is essential for keeping models up-to-date. Finally, storing model results, hyperparameters, and other repetitive information in a database is important for managing and comparing different versions over time.

This post is brought to you by Mandar Bhurchandi, who is a fellow writer from our writer’s program.