top of page

Search

23 results found with an empty search

  • Logistic Regression (Machine Learning Algorithm)

    Description : BY EXAMPLE... let’s say your friend gives you a puzzle to solve. There are only 2 outcome scenarios – either you solve it or you don’t. Now imagine, that you are being given wide range of puzzles / quizzes in an attempt to understand which subjects you are good at. The outcome to this study would be something like this – if you are given a trigonometry based tenth grade problem, you are 70% likely to solve it. On the other hand, if it is grade fifth history question, the probability of getting an answer is only 30%. This is what Logistic Regression provides you. It is a classification algorithm. It is used to estimate discrete values ( Binary values like 0/1, yes/no, true/false ) based on given set of independent variable(s). In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function. Hence, it is also known as logit regression. Since, it predicts the probability, its output values lies between 0 and 1 (as expected). There are many different steps that could be tried in order to improve the model: including interaction terms, removing features regularization techniques, using a non-linear model. Algorithm : Coming to the math, the log odds of the outcome is modeled as a linear combination of the predictor variables. odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence ln(odds) = ln(p/(1-p)) logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk Above, p is the probability of presence of the characteristic of interest. It chooses parameters that maximize the likelihood of observing the sample values rather than that minimize the sum of squared errors (like in ordinary regression).

  • Decision Tree (Machine Learning Algorithm)

    Description : The best way to understand how decision tree works, is to play Jezzball – a classic game from Microsoft (image below). Essentially, you have a room with moving walls and you need to create walls such that maximum area gets cleared off with out the balls. So, every time you split the room with a wall, you are trying to create 2 different populations with in the same room. Decision trees work in very similar fashion by dividing a population in as different groups as possible . Surprisingly, it works for both categorical and continuous dependent variables . In this algorithm, we split the population into two or more homogeneous sets. This is done based on most significant attributes/ independent variables to make as distinct groups as possible. Algorithm : By example, to split the population into different heterogeneous groups, it uses various techniques like Gini, Information Gain, Chi-square, entropy.

  • Machine Learning Algorithm - Support Vector Machine (SVM)

    Description : By example, if we only had two features like Height and Hair length of an individual, we’d first plot these two variables in two dimensional space where each point has two co-ordinates (these co-ordinates are known as Support Vectors).   w, we will find some line that splits the data between the two differently classified groups of data. This will be the line such that the distances from the closest point in each of the two groups will be farthest away.  In the example shown above, the line which splits the data into two differently classified groups is the black line, since the two closest points are the farthest apart from the line. This line is our classifier. Then, depending on where the testing data lands on either side of the line, that’s what class we can classify the new data as. It is a classification method.  Algorithm : In this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate.

  • Naïve Bayes (Machine Learning Algorithm)

    Description : In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, a naive Bayes classifier would consider all of these properties to independently contribute to the probability that this fruit is an apple. Let’s understand it using an example. Below I have a training data set of weather and corresponding target variable ‘Play’. Now, we need to classify whether players will play or not based on weather condition. Let’s follow the below steps to perform it. Step 1: Convert the data set to frequency table Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and probability of playing is 0.64. It is a classification technique based on Bayes’ theorem with an assumption of independence between predictors.  Naive Bayes uses a similar method to predict the probability of different class based on various attributes. This algorithm is mostly used in text classification and with problems having multiple classes. Naive Bayesian model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods. Algorithm : Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equation below: P(c|x) is the posterior probability of class (target) given predictor (attribute). P(c) is the prior probability of class. P(x|c) is the likelihood which is the probability of predictor given class. P(x) is the prior probability of predictor.

  • Fisher's Linear Discriminant (Machine Learning Algorithm)

    A deep dive.... Description : We can view linear classification models in terms of dimensionality reduction. Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant , a method used in statistics , pattern recognition , and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier , or, more commonly, for dimensionality reduction before later classification . It is a classification method.  Algorithm : To begin, consider the case of a two-class classification problem (K=2) . Blue and red points in R². In general, we can take any D-dimensional input vector and project it down to D’-dimensions. Here, D represents the original input dimensions while D’ is the projected space dimensions. Throughout this article, consider D’ less than D . In the case of projecting to one dimension (the number line), i.e. D’=1 , we can pick a threshold t to separate the classes in the new space. Given an input vector x : if the predicted value y >= t then, x belongs to class C1 (class 1) - where 📷.otherwise, it is classified as C2 (class 2). Take the dataset below as a toy example. We want to reduce the original data dimensions from D=2 to D’=1. In other words, we want a transformation T that maps vectors in 2D to 1D - T(v) = ℝ² →ℝ¹. First, let’s compute the mean vectors m1 and m2 for the two classes. Note that N1 and N2 denote the number of points in classes C1 and C2 respectively. Now, consider using the class means as a measure of separation. In other words, we want to project the data onto the vector W joining the 2 class means. It is important to note that any kind of projection to a smaller dimension might involve some loss of information. In this scenario, note that the two classes are clearly separable (by a line) in their original space.  That is where the Fisher’s Linear Discriminant comes into play. The idea proposed by Fisher is to maximize a function that will give a large separation between the projected class means while also giving a small variance within each class, thereby minimizing the class overlap. In other words, FLD selects a projection that maximizes the class separation. To do that, it maximizes the ratio between the between-class variance to the within-class variance. In short, to project the data to a smaller dimension and to avoid class overlapping, FLD maintains 2 properties. A large variance among the dataset classes.A small variance within each of the dataset classes. Note that a large between-class variance means that the projected class averages should be as far apart as possible. On the contrary, a small within-class variance has the effect of keeping the projected data points closer to one another. To find the projection with the following properties, FLD learns a weight vector W with the following criterion. If we substitute the mean vectors m1 and m2 as well as the variance s as given by equations (1) and (2) we arrive at equation (3). If we take the derivative of (3) w.r.t W (after some simplifications) we get the learning equation for W (equation 4).  That is, W (our desired transformation) is directly proportional to the inverse of the within-class covariance matrix times the difference of the class means. As expected, the result allows a perfect class separation with simple thresholding. For multiple classes, read on https://sthalles.github.io/fisher-linear-discriminant/ .

  • Data Products 101

    What is a data product? (circa 2024) Business Data that we want to reuse so we apply the software development lifecycle to it. When we reuse data products we increase consistency. Data consistency provides data quality. A data product is a trusted, reusable, and consumable data asset that solves business problems, generates insights, and/or improves operational efficiency. It can be a database table, report, APIs, or machine learning model. Technical A data product is made up of metadata and dataset instances Designed to be easily accessible to anyone with the right credentials Data products are the backbone of powerful data apps and help bridge the gap between data producers and consumers.

  • Identity Resolution

    Over the last several years, I have spent a lot of time thinking about identity resolution and how to do this in an ever-expanding ecosystem. Identity Resolution is one of the most important ingredients in data assets. What is identity resolution and how does it create value? Identity resolution — the process of matching identifiers across devices and touchpoints to a single profile — helps build a cohesive, omnichannel view of a person/donor/organization/consumer, enabling brands to deliver relevant messaging throughout the customer journey. Value is created through a consolidated, accurate 360-degree profile. Identity resolution results when many data sources are integrated from many channels and devices in an accurate, scalable, and privacy-compliant way to create a persistent and addressable individual profile. How is it used? Impact Reporting — accuracy Predictive analytics — accuracy Who's doing this in business and talking about it? Great description and example at Civis Analytics Another good example at Experian A definitive guide from Segment.com Drop me a note and your thoughts. -DCN

  • 2025 Snapshot

    Fisher’s Linear Discriminate This blog continues to be a resource for data scientists studying FLD. 19 Countries Top international visitors from India, Russia, and Singapore. 19 States The US is the number one country for visitors coming primarily from NY, DC, and CA. Look for more posts in 2026 highligting how to build next generation products from 0 to 1.

  • Getting started in Analytics & Data Science: UC Berkeley Executive Education

    I'm thinking about graduate school and have an interesting project at work. To help guide both, I have started a 12 week learning adventure with UC Berkeley Haas Data Science: Bridging Principles and Practice. Week 0 was assessment and refresh. Statistics for Business: Decision Making and Analysis, 3rd. Edition . VARIABLES categorical variable:   Column of values in a data table that identifies cases with a common attribute. Sometimes called qualitative or nominal  variables (no order, see ordinal).  (Fruit, Types, zip codes, Identification Numbers, etc).  these are continuous variables. ordinal variable :  A categorical variable whose labels have a natural order.  (rating system 10 Best and 1 Worst).   A  Likert scale is a measurement scale that produces ordinal data, typically with five to seven categories.   Another example can be Tiny, Small, Med, Large, Jumbo. numerical variable : Column of values in a data table that records numerical properties of cases (also called continuous variable).  (Amounts, dates, times) measurement unit : Scale that defines the meaning of numerical data, such as weights measured in kilograms or purchases measured in dollars.  CAUTION: The data that make up a numerical variable in a data table must share a common unit. area principle:  The area of a plot that shows data should be proportional to the amount of data. time series : A sequence of data recorded over time. timeplot :  A graph of a time series showing the values in chronological order. frequency : The time spacing of data recorded in a time series. distribution :  The collection of values of a variable and how often each occurs.  frequency table: A tabular summary that shows the distribution of a variable, count. relative frequency : The frequency of a category divided by the number of cases; a proportion or percentage.

  • Getting started in Analytics & Data Science

    Dec 2018 - Feb 2019 I started with Practical Statistics for Data Scientists . This was a great way to think about the math, visualize it and get introduced to Python/R. Start with the definition of a mean and terminology like data frame, feature and outcomes. Move through statistics concepts like boxplots, scatterplots, central limit theorem, binomial distribution and significance testing including P-values. There's an entire chapter dedicated to Machine Learning algorithms like K-Nearest Neighbors, Bagging and Random Forests, Boosting and more. Pro tip: Focus on visualizing.

  • Data Products: Standards up front

    For those of you looking to establish a new department at work or want to play with modeling for the first time, you should think about standards. When is it done? How do you know it's good enough to productize? Standards and protocols. When should standards be established? Set standards at the beginning. Even if your standards are a default set by an education institution, that's good enough. It only counts if you write it down though. Where do I start? At the very least, there's always best approaches and most appropriate models to use for problems and data sources (if it's not time data you won't use ARIMA). At the top of the project, take the time to work on a project plan that includes data models intended for data exploration and proof of concept phases. Each model should have a standard. Here's a couple of examples: Logistic Regression, AUC > .7 Linear Regression, r-squared > .6 Do not allow Forests models to be used. The nature of these models obscures us from having access to metrics for deep analysis. Reserve these models for boosting. What are the basic protocols to always use? Establish and codify your standards in the planning phase. Always deliver conclusions with visuals like ROC, gg-plots, elbow-charts, etc. Even as a professional, it's always fun to kick off an experiment. Drop me a note and tell me how you do it. -DCN

  • Data Products: Too good to be true

    If you are productizing a predictive model at work or playing around with MLE in R for the first time, always check the data. Roles When working on a predictive model at work, everyone has a role. Product management drives the requirements and determines rightness. Data samples and ETL code from Data Engineering Prepared data samples, models and code from Data Scientists Case Study A few weeks ago, we finished the second of three models to complete a proof of concept and prepare for roadmap estimates. The PM was testing toward the end of each phase. The team modeler was sharing first pass results at a weekly check-in and his first statement was "I need to check the data before moving forward, but the predictive results on this are amazing! Near 99%". On the team's daily #slack meet the next day, he reported a data error. The same model AUC's was around .33 now and the model had the ugliest confusion matrix. We tossed it. Drop me a note and tell me how you do it. -DCN

Subscribe Form

Thanks for submitting!

©2026 by Danielle Costa Nakano.

  • LinkedIn
bottom of page