Search

All (23)

Blog Posts (22)

Other Pages (1)

22 results found with an empty search

Getting started in Analytics & Data Science: UC Berkeley Executive Education
I'm thinking about graduate school and have an interesting project at work. To help guide both, I have started a 12 week learning adventure with UC Berkeley Haas Data Science: Bridging Principles and Practice. Week 0 was assessment and refresh. Statistics for Business: Decision Making and Analysis, 3rd. Edition . VARIABLES categorical variable: Column of values in a data table that identifies cases with a common attribute. Sometimes called qualitative or nominal variables (no order, see ordinal). (Fruit, Types, zip codes, Identification Numbers, etc). these are continuous variables. ordinal variable : A categorical variable whose labels have a natural order. (rating system 10 Best and 1 Worst). A Likert scale is a measurement scale that produces ordinal data, typically with five to seven categories. Another example can be Tiny, Small, Med, Large, Jumbo. numerical variable : Column of values in a data table that records numerical properties of cases (also called continuous variable). (Amounts, dates, times) measurement unit : Scale that defines the meaning of numerical data, such as weights measured in kilograms or purchases measured in dollars. CAUTION: The data that make up a numerical variable in a data table must share a common unit. area principle: The area of a plot that shows data should be proportional to the amount of data. time series : A sequence of data recorded over time. timeplot : A graph of a time series showing the values in chronological order. frequency : The time spacing of data recorded in a time series. distribution : The collection of values of a variable and how often each occurs. frequency table: A tabular summary that shows the distribution of a variable, count. relative frequency : The frequency of a category divided by the number of cases; a proportion or percentage.
Getting started in Analytics & Data Science
Dec 2018 - Feb 2019 I started with Practical Statistics for Data Scientists . This was a great way to think about the math, visualize it and get introduced to Python/R. Start with the definition of a mean and terminology like data frame, feature and outcomes. Move through statistics concepts like boxplots, scatterplots, central limit theorem, binomial distribution and significance testing including P-values. There's an entire chapter dedicated to Machine Learning algorithms like K-Nearest Neighbors, Bagging and Random Forests, Boosting and more. Pro tip: Focus on visualizing.
Data Products: Standards up front
For those of you looking to establish a new department at work or want to play with modeling for the first time, you should think about standards. When is it done? How do you know it's good enough to productize? Standards and protocols. When should standards be established? Set standards at the beginning. Even if your standards are a default set by an education institution, that's good enough. It only counts if you write it down though. Where do I start? At the very least, there's always best approaches and most appropriate models to use for problems and data sources (if it's not time data you won't use ARIMA). At the top of the project, take the time to work on a project plan that includes data models intended for data exploration and proof of concept phases. Each model should have a standard. Here's a couple of examples: Logistic Regression, AUC > .7 Linear Regression, r-squared > .6 Do not allow Forests models to be used. The nature of these models obscures us from having access to metrics for deep analysis. Reserve these models for boosting. What are the basic protocols to always use? Establish and codify your standards in the planning phase. Always deliver conclusions with visuals like ROC, gg-plots, elbow-charts, etc. Even as a professional, it's always fun to kick off an experiment. Drop me a note and tell me how you do it. -DCN
Data Products: Too good to be true
If you are productizing a predictive model at work or playing around with MLE in R for the first time, always check the data. Roles When working on a predictive model at work, everyone has a role. Product management drives the requirements and determines rightness. Data samples and ETL code from Data Engineering Prepared data samples, models and code from Data Scientists Case Study A few weeks ago, we finished the second of three models to complete a proof of concept and prepare for roadmap estimates. The PM was testing toward the end of each phase. The team modeler was sharing first pass results at a weekly check-in and his first statement was "I need to check the data before moving forward, but the predictive results on this are amazing! Near 99%". On the team's daily #slack meet the next day, he reported a data error. The same model AUC's was around .33 now and the model had the ugliest confusion matrix. We tossed it. Drop me a note and tell me how you do it. -DCN
Data Products: Case Studies to build trust
If you are productizing predictive analytics or just trying to get buy-in around a model, try a case study to keep the momentum building and project moving forward. Partner with revenue teams and client-facing leaders to recommend clients for invitation-only case studies. Select organizations with leadership that believes in innovation and the size to try new things. Manage your resources. Hedge your bets and try to have at least two participants in each study. Don't underestimate how much work this can be. Pick a maximum number of participants. Be prepared to curate and guide a journey through lessons-learned and success. Drop me a note and tell me how you do it. -DCN
Data Products 101
What is a data product? Business Data that we want to reuse so we apply the software development lifecycle to it. When we reuse data products we increase consistency. Data consistency provides data quality. A data product is a trusted, reusable, and consumable data asset that solves business problems, generates insights, and/or improves operational efficiency. It can be a database table, report, APIs, or machine learning model. Technical A data product is made up of metadata and dataset instances Designed to be easily accessible to anyone with the right credentials Data products are the backbone of powerful data apps and help bridge the gap between data producers and consumers.
Data Strategy: Path to Monetization
PREMISE The world of data has leaped forward in the last few years. Real-time streaming data has become more accessible and future-forwarded businesses are taking the helm to turn their data into gold. STRATEGY Strategy driven by data maturity principles and product management methodology Core Principles Essential Methods Data Maturity Assessment Core principles Everything starts with business value and business driven use cases. Data Governance is the foundation and ongoing practice. Strategy execution driven By Data Maturity Principles & Product Management methodologies. Essential methods Vision with a north star , annual shared goals, and team autonomy to solve problems. Data governance and data products because SLDC drives quality. Action with best practices and evolving KPIs. Including picking the right product and engineering organization and creating a transparent and accountable organization. Annual data maturity assessment Where are you? Where do you need to go? Where do you want to go? Frameworks and assessments for facets like data governance, data management, quality, privacy, security, policy, architecture, and literacy. How to get there Phase 1: Democratize the data Phase 2: Empower & modernize the workplace Phase 3: Impact Flywheel
KNN or K Nearest Neighborhood)l (Machine Learning Algorithm)
Description : KNN can easily be mapped to our real lives. If you want to learn about a person, of whom you have no information, you might like to find out about his close friends and the circles he moves in and gain access to his/her information! It can be used for both classification and regression problems. However, it is more widely used in classification problems in the industry. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times, choosing K turns out to be a challenge while performing kNN modeling. Algorithm : K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. The case being assigned to the class is most common amongst its K nearest neighbors measured by a distance function. These distance functions can be Euclidean, Manhattan, Minkowski and Hamming distance. First three functions are used for continuous function and fourth one (Hamming) for categorical variables.
Logistic Regression (Machine Learning Algorithm)
Description : BY EXAMPLE... let’s say your friend gives you a puzzle to solve. There are only 2 outcome scenarios – either you solve it or you don’t. Now imagine, that you are being given wide range of puzzles / quizzes in an attempt to understand which subjects you are good at. The outcome to this study would be something like this – if you are given a trigonometry based tenth grade problem, you are 70% likely to solve it. On the other hand, if it is grade fifth history question, the probability of getting an answer is only 30%. This is what Logistic Regression provides you. It is a classification algorithm. It is used to estimate discrete values ( Binary values like 0/1, yes/no, true/false ) based on given set of independent variable(s). In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function. Hence, it is also known as logit regression. Since, it predicts the probability, its output values lies between 0 and 1 (as expected). There are many different steps that could be tried in order to improve the model: including interaction terms, removing features regularization techniques, using a non-linear model. Algorithm : Coming to the math, the log odds of the outcome is modeled as a linear combination of the predictor variables. odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence ln(odds) = ln(p/(1-p)) logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk Above, p is the probability of presence of the characteristic of interest. It chooses parameters that maximize the likelihood of observing the sample values rather than that minimize the sum of squared errors (like in ordinary regression).
Decision Tree (Machine Learning Algorithm)
Description : The best way to understand how decision tree works, is to play Jezzball – a classic game from Microsoft (image below). Essentially, you have a room with moving walls and you need to create walls such that maximum area gets cleared off with out the balls. So, every time you split the room with a wall, you are trying to create 2 different populations with in the same room. Decision trees work in very similar fashion by dividing a population in as different groups as possible. Surprisingly, it works for both categorical and continuous dependent variables . In this algorithm, we split the population into two or more homogeneous sets. This is done based on most significant attributes/ independent variables to make as distinct groups as possible. Algorithm : By example, to split the population into different heterogeneous groups, it uses various techniques like Gini, Information Gain, Chi-square, entropy.
Identity Resolution
Over the last year, I have spent a lot of time thinking about identity resolution and how to do this in an ever-expanding ecosystem. Identity Resolution is one of the most important ingredients in data assets. What is identity resolution and how does it create value? Identity resolution — the process of matching identifiers across devices and touchpoints to a single profile — helps build a cohesive, omnichannel view of a person/donor/organization/consumer, enabling brands to deliver relevant messaging throughout the customer journey. Value is created through a consolidated, accurate 360-degree profile. Identity resolution results when many data sources are integrated from many channels and devices in an accurate, scalable, and privacy-compliant way to create a persistent and addressable individual profile. How is it used? Impact Reporting — accuracy Predictive analytics — accuracy Who's doing this in business and talking about it? Great description and example at Civis Analytics Another good example at Experian A definitive guide from Segment.com Drop me a note and your thoughts. -DCN
Linear Regression (Machine Learning Algorithm)
Description : It is used to estimate real values (cost of houses, number of calls, total sales etc.) based on continuous variable(s). Here, we establish relationship between independent and dependent variables by fitting a best line. Linear Regression is of mainly two types: Simple Linear Regression and Multiple Linear Regression. Simple Linear Regression is characterized by one independent variable. Multiple Linear Regression(as the name suggests) is characterized by multiple (more than 1) independent variables. While finding best fit line, you can fit a polynomial or curvilinear regression. And these are known as polynomial or curvilinear regression. Algorithm : This best fit line is known as regression line and represented by a linear equation Y= a *X + b. n this equation: Y – Dependent Variable, a – Slope, X – Independent variable, b – Intercept. These coefficients a and b are derived based on minimizing the sum of squared difference of distance between data points and regression line.

Search

22 results found with an empty search

Subscribe Form