Search
23 results found with an empty search
- Fisher's Linear Discriminant (Machine Learning Algorithm)
A deep dive.... Description : We can view linear classification models in terms of dimensionality reduction. Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant , a method used in statistics , pattern recognition , and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier , or, more commonly, for dimensionality reduction before later classification . It is a classification method. Algorithm : To begin, consider the case of a two-class classification problem (K=2) . Blue and red points in R². In general, we can take any D-dimensional input vector and project it down to D’-dimensions. Here, D represents the original input dimensions while D’ is the projected space dimensions. Throughout this article, consider D’ less than D . In the case of projecting to one dimension (the number line), i.e. D’=1 , we can pick a threshold t to separate the classes in the new space. Given an input vector x : if the predicted value y >= t then, x belongs to class C1 (class 1) - where 📷.otherwise, it is classified as C2 (class 2). Take the dataset below as a toy example. We want to reduce the original data dimensions from D=2 to D’=1. In other words, we want a transformation T that maps vectors in 2D to 1D - T(v) = ℝ² →ℝ¹. First, let’s compute the mean vectors m1 and m2 for the two classes. Note that N1 and N2 denote the number of points in classes C1 and C2 respectively. Now, consider using the class means as a measure of separation. In other words, we want to project the data onto the vector W joining the 2 class means. It is important to note that any kind of projection to a smaller dimension might involve some loss of information. In this scenario, note that the two classes are clearly separable (by a line) in their original space. That is where the Fisher’s Linear Discriminant comes into play. The idea proposed by Fisher is to maximize a function that will give a large separation between the projected class means while also giving a small variance within each class, thereby minimizing the class overlap. In other words, FLD selects a projection that maximizes the class separation. To do that, it maximizes the ratio between the between-class variance to the within-class variance. In short, to project the data to a smaller dimension and to avoid class overlapping, FLD maintains 2 properties. A large variance among the dataset classes.A small variance within each of the dataset classes. Note that a large between-class variance means that the projected class averages should be as far apart as possible. On the contrary, a small within-class variance has the effect of keeping the projected data points closer to one another. To find the projection with the following properties, FLD learns a weight vector W with the following criterion. If we substitute the mean vectors m1 and m2 as well as the variance s as given by equations (1) and (2) we arrive at equation (3). If we take the derivative of (3) w.r.t W (after some simplifications) we get the learning equation for W (equation 4). That is, W (our desired transformation) is directly proportional to the inverse of the within-class covariance matrix times the difference of the class means. As expected, the result allows a perfect class separation with simple thresholding. For multiple classes, read on https://sthalles.github.io/fisher-linear-discriminant/ .
- Data Products 101
What is a data product? (circa 2024) Business Data that we want to reuse so we apply the software development lifecycle to it. When we reuse data products we increase consistency. Data consistency provides data quality. A data product is a trusted, reusable, and consumable data asset that solves business problems, generates insights, and/or improves operational efficiency. It can be a database table, report, APIs, or machine learning model. Technical A data product is made up of metadata and dataset instances Designed to be easily accessible to anyone with the right credentials Data products are the backbone of powerful data apps and help bridge the gap between data producers and consumers.
- Identity Resolution
Over the last several years, I have spent a lot of time thinking about identity resolution and how to do this in an ever-expanding ecosystem. Identity Resolution is one of the most important ingredients in data assets. What is identity resolution and how does it create value? Identity resolution — the process of matching identifiers across devices and touchpoints to a single profile — helps build a cohesive, omnichannel view of a person/donor/organization/consumer, enabling brands to deliver relevant messaging throughout the customer journey. Value is created through a consolidated, accurate 360-degree profile. Identity resolution results when many data sources are integrated from many channels and devices in an accurate, scalable, and privacy-compliant way to create a persistent and addressable individual profile. How is it used? Impact Reporting — accuracy Predictive analytics — accuracy Who's doing this in business and talking about it? Great description and example at Civis Analytics Another good example at Experian A definitive guide from Segment.com Drop me a note and your thoughts. -DCN
- 2025 Snapshot
Fisher’s Linear Discriminate This blog continues to be a resource for data scientists studying FLD. 19 Countries Top international visitors from India, Russia, and Singapore. 19 States The US is the number one country for visitors coming primarily from NY, DC, and CA. Look for more posts in 2026 highligting how to build next generation products from 0 to 1.
- Getting started in Analytics & Data Science: UC Berkeley Executive Education
I'm thinking about graduate school and have an interesting project at work. To help guide both, I have started a 12 week learning adventure with UC Berkeley Haas Data Science: Bridging Principles and Practice. Week 0 was assessment and refresh. Statistics for Business: Decision Making and Analysis, 3rd. Edition . VARIABLES categorical variable: Column of values in a data table that identifies cases with a common attribute. Sometimes called qualitative or nominal variables (no order, see ordinal). (Fruit, Types, zip codes, Identification Numbers, etc). these are continuous variables. ordinal variable : A categorical variable whose labels have a natural order. (rating system 10 Best and 1 Worst). A Likert scale is a measurement scale that produces ordinal data, typically with five to seven categories. Another example can be Tiny, Small, Med, Large, Jumbo. numerical variable : Column of values in a data table that records numerical properties of cases (also called continuous variable). (Amounts, dates, times) measurement unit : Scale that defines the meaning of numerical data, such as weights measured in kilograms or purchases measured in dollars. CAUTION: The data that make up a numerical variable in a data table must share a common unit. area principle: The area of a plot that shows data should be proportional to the amount of data. time series : A sequence of data recorded over time. timeplot : A graph of a time series showing the values in chronological order. frequency : The time spacing of data recorded in a time series. distribution : The collection of values of a variable and how often each occurs. frequency table: A tabular summary that shows the distribution of a variable, count. relative frequency : The frequency of a category divided by the number of cases; a proportion or percentage.
- Getting started in Analytics & Data Science
Dec 2018 - Feb 2019 I started with Practical Statistics for Data Scientists . This was a great way to think about the math, visualize it and get introduced to Python/R. Start with the definition of a mean and terminology like data frame, feature and outcomes. Move through statistics concepts like boxplots, scatterplots, central limit theorem, binomial distribution and significance testing including P-values. There's an entire chapter dedicated to Machine Learning algorithms like K-Nearest Neighbors, Bagging and Random Forests, Boosting and more. Pro tip: Focus on visualizing.
- Data Products: Standards up front
For those of you looking to establish a new department at work or want to play with modeling for the first time, you should think about standards. When is it done? How do you know it's good enough to productize? Standards and protocols. When should standards be established? Set standards at the beginning. Even if your standards are a default set by an education institution, that's good enough. It only counts if you write it down though. Where do I start? At the very least, there's always best approaches and most appropriate models to use for problems and data sources (if it's not time data you won't use ARIMA). At the top of the project, take the time to work on a project plan that includes data models intended for data exploration and proof of concept phases. Each model should have a standard. Here's a couple of examples: Logistic Regression, AUC > .7 Linear Regression, r-squared > .6 Do not allow Forests models to be used. The nature of these models obscures us from having access to metrics for deep analysis. Reserve these models for boosting. What are the basic protocols to always use? Establish and codify your standards in the planning phase. Always deliver conclusions with visuals like ROC, gg-plots, elbow-charts, etc. Even as a professional, it's always fun to kick off an experiment. Drop me a note and tell me how you do it. -DCN
- Data Products: Too good to be true
If you are productizing a predictive model at work or playing around with MLE in R for the first time, always check the data. Roles When working on a predictive model at work, everyone has a role. Product management drives the requirements and determines rightness. Data samples and ETL code from Data Engineering Prepared data samples, models and code from Data Scientists Case Study A few weeks ago, we finished the second of three models to complete a proof of concept and prepare for roadmap estimates. The PM was testing toward the end of each phase. The team modeler was sharing first pass results at a weekly check-in and his first statement was "I need to check the data before moving forward, but the predictive results on this are amazing! Near 99%". On the team's daily #slack meet the next day, he reported a data error. The same model AUC's was around .33 now and the model had the ugliest confusion matrix. We tossed it. Drop me a note and tell me how you do it. -DCN
- Data Products: Case Studies to build trust
If you are productizing predictive analytics or just trying to get buy-in around a model, try a case study to keep the momentum building and project moving forward. Partner with revenue teams and client-facing leaders to recommend clients for invitation-only case studies. Select organizations with leadership that believes in innovation and the size to try new things. Manage your resources. Hedge your bets and try to have at least two participants in each study. Don't underestimate how much work this can be. Pick a maximum number of participants. Be prepared to curate and guide a journey through lessons-learned and success. Drop me a note and tell me how you do it. -DCN
- Data Strategy: Path to Monetization
PREMISE The world of data has leaped forward in the last few years. Real-time streaming data has become more accessible and future-forwarded businesses are taking the helm to turn their data into gold. STRATEGY Strategy driven by data maturity principles and product management methodology Core Principles Essential Methods Data Maturity Assessment Core principles Everything starts with business value and business driven use cases. Data Governance is the foundation and ongoing practice. Strategy execution driven By Data Maturity Principles & Product Management methodologies. Essential methods Vision with a north star , annual shared goals, and team autonomy to solve problems. Data governance and data products because SLDC drives quality. Action with best practices and evolving KPIs. Including picking the right product and engineering organization and creating a transparent and accountable organization. Annual data maturity assessment Where are you? Where do you need to go? Where do you want to go? Frameworks and assessments for facets like data governance, data management, quality, privacy, security, policy, architecture, and literacy. How to get there Phase 1: Democratize the data Phase 2: Empower & modernize the workplace Phase 3: Impact Flywheel
- KNN or K Nearest Neighborhood)l (Machine Learning Algorithm)
Description : KNN can easily be mapped to our real lives. If you want to learn about a person, of whom you have no information, you might like to find out about his close friends and the circles he moves in and gain access to his/her information! It can be used for both classification and regression problems. However, it is more widely used in classification problems in the industry. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times, choosing K turns out to be a challenge while performing kNN modeling. Algorithm : K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. The case being assigned to the class is most common amongst its K nearest neighbors measured by a distance function. These distance functions can be Euclidean, Manhattan, Minkowski and Hamming distance. First three functions are used for continuous function and fourth one (Hamming) for categorical variables.
- Logistic Regression (Machine Learning Algorithm)
Description : BY EXAMPLE... let’s say your friend gives you a puzzle to solve. There are only 2 outcome scenarios – either you solve it or you don’t. Now imagine, that you are being given wide range of puzzles / quizzes in an attempt to understand which subjects you are good at. The outcome to this study would be something like this – if you are given a trigonometry based tenth grade problem, you are 70% likely to solve it. On the other hand, if it is grade fifth history question, the probability of getting an answer is only 30%. This is what Logistic Regression provides you. It is a classification algorithm. It is used to estimate discrete values ( Binary values like 0/1, yes/no, true/false ) based on given set of independent variable(s). In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function. Hence, it is also known as logit regression. Since, it predicts the probability, its output values lies between 0 and 1 (as expected). There are many different steps that could be tried in order to improve the model: including interaction terms, removing features regularization techniques, using a non-linear model. Algorithm : Coming to the math, the log odds of the outcome is modeled as a linear combination of the predictor variables. odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence ln(odds) = ln(p/(1-p)) logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk Above, p is the probability of presence of the characteristic of interest. It chooses parameters that maximize the likelihood of observing the sample values rather than that minimize the sum of squared errors (like in ordinary regression).





