{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Case 21.6 Segmenting Consumers of Bath Soap\n", "\n", "> (c) 2019 Galit Shmueli, Peter C. Bruce, Peter Gedeck \n", ">\n", "> Case study included in\n", ">\n", "> _Data Mining for Business Analytics: Concepts, Techniques, and Applications in Python_ (First Edition) \n", "> Galit Shmueli, Peter C. Bruce, Peter Gedeck, and Nitin R. Patel. 2019." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "no display found. Using non-interactive Agg backend\n" ] } ], "source": [ "from pathlib import Path\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from sklearn.cluster import KMeans\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.ensemble import BaggingClassifier\n", "\n", "import matplotlib.pylab as plt\n", "\n", "import dmba\n", "from dmba import classificationSummary, gainsChart\n", "\n", "from IPython.display import display_html\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(600, 46)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Member idSECFEHMTSEXAGEEDUHSCHILDCS...PropCat 6PropCat 7PropCat 8PropCat 9PropCat 10PropCat 11PropCat 12PropCat 13PropCat 14PropCat 15
010100104310144241...0.0000000.0000000.0000000.0000000.00.0000000.0280370.00.1308410.339564
110100203210224421...0.3470480.0268340.0161000.0143110.00.0590340.0000000.00.0805010.000000
210140202310245641...0.1212120.0335500.0108230.0086580.00.0000000.0162340.00.5616880.003247
31014030400040050...0.0000000.0000000.0000000.0000000.00.0000000.0000000.00.6000000.000000
410141904110234431...0.0000000.0000000.0481930.0000000.00.0000000.0000000.00.1445780.000000
\n", "

5 rows × 46 columns

\n", "
" ], "text/plain": [ " Member id SEC FEH MT SEX AGE EDU HS CHILD CS ... PropCat 6 \\\n", "0 1010010 4 3 10 1 4 4 2 4 1 ... 0.000000 \n", "1 1010020 3 2 10 2 2 4 4 2 1 ... 0.347048 \n", "2 1014020 2 3 10 2 4 5 6 4 1 ... 0.121212 \n", "3 1014030 4 0 0 0 4 0 0 5 0 ... 0.000000 \n", "4 1014190 4 1 10 2 3 4 4 3 1 ... 0.000000 \n", "\n", " PropCat 7 PropCat 8 PropCat 9 PropCat 10 PropCat 11 PropCat 12 \\\n", "0 0.000000 0.000000 0.000000 0.0 0.000000 0.028037 \n", "1 0.026834 0.016100 0.014311 0.0 0.059034 0.000000 \n", "2 0.033550 0.010823 0.008658 0.0 0.000000 0.016234 \n", "3 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 \n", "4 0.000000 0.048193 0.000000 0.0 0.000000 0.000000 \n", "\n", " PropCat 13 PropCat 14 PropCat 15 \n", "0 0.0 0.130841 0.339564 \n", "1 0.0 0.080501 0.000000 \n", "2 0.0 0.561688 0.003247 \n", "3 0.0 0.600000 0.000000 \n", "4 0.0 0.144578 0.000000 \n", "\n", "[5 rows x 46 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bathSoap_df = dmba.load_data('BathSoapHousehold.csv')\n", "print(bathSoap_df.shape)\n", "bathSoap_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 1:\n", "Use $k$-means clustering to identify clusters of households based on:\n", "\n", "1. The variables that describe purchase behavior (including brand loyalty)\n", "2. The variables that describe the basis for purchase\n", "3. The variables that describe both purchase behavior and basis of purchase\n", "\n", "Note 1: How should $k$ be chosen? Think about how the clusters would be used. It is likely that the marketing efforts would support two to five different promotional approaches.\n", "\n", "Note 2: How should the percentages of total purchases comprised by various brands be treated? Isn't a customer\n", "who buys all brand A just as loyal as a customer who buys all brand B? What will be the effect on any distance measure of using the brand share variables as is? Consider using a single derived variable.\n", "\n", "We look first at clusters based on purchase behavior, then clusters based on the basis for purchase, then clusters based on both. The complexity of marketing to 5 segments would probably not be supported by clustering just based on purchase behavior, or clustering just based on basis for purchase, so we will look at 2-3 clusters for those variables, and more when we cluster using both sets of variables.\n", "\n", "In choosing 𝑘 , we would seek a 𝑘 that produces clusters that are distinct and separate from one another, in ways (variables) that are translatable into marketing actions. The variables we have been asked to consider are those that relate to purchase behavior (volume and frequency of purchase, brand loyalty), and a separate set that relate to the basis for purchase (response to promotions, pricing, and selling proposition)." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Demographic Index(['SEC', 'FEH', 'MT', 'SEX', 'AGE', 'EDU', 'HS', 'CHILD', 'CS',\n", " 'Affluence Index'],\n", " dtype='object')\n", "Purchase Index(['No. of Brands', 'Brand Runs', 'Total Volume', 'No. of Trans', 'Value',\n", " 'Trans / Brand Runs', 'Vol/Tran', 'Avg. Price '],\n", " dtype='object')\n", "Promotion Index(['Pur Vol No Promo - %', 'Pur Vol Promo 6 %', 'Pur Vol Other Promo %'], dtype='object')\n", "Brand Index(['Br. Cd. 57, 144', 'Br. Cd. 55', 'Br. Cd. 272', 'Br. Cd. 286',\n", " 'Br. Cd. 24', 'Br. Cd. 481', 'Br. Cd. 352', 'Br. Cd. 5'],\n", " dtype='object')\n", "Other brand Index(['Others 999'], dtype='object')\n", "Price category Index(['Pr Cat 1', 'Pr Cat 2', 'Pr Cat 3', 'Pr Cat 4'], dtype='object')\n", "Selling property Index(['PropCat 5', 'PropCat 6', 'PropCat 7', 'PropCat 8', 'PropCat 9',\n", " 'PropCat 10', 'PropCat 11', 'PropCat 12', 'PropCat 13', 'PropCat 14',\n", " 'PropCat 15'],\n", " dtype='object')\n" ] } ], "source": [ "# Group columns into sets for further analysis\n", "demographicIndicators = bathSoap_df.columns[1:11]\n", "purchaseIndicator = bathSoap_df.columns[11:19]\n", "withinPromotionIndicator = bathSoap_df.columns[19:22]\n", "brandIndicator = bathSoap_df.columns[22:30]\n", "otherBrandIndicator = bathSoap_df.columns[30:31]\n", "priceCategoryIndicator = bathSoap_df.columns[31:35]\n", "sellingPropertyIndicator = bathSoap_df.columns[35:46]\n", "\n", "print('Demographic', demographicIndicators)\n", "print('Purchase', purchaseIndicator)\n", "print('Promotion', withinPromotionIndicator)\n", "print('Brand', brandIndicator)\n", "print('Other brand', otherBrandIndicator)\n", "print('Price category', priceCategoryIndicator)\n", "print('Selling property', sellingPropertyIndicator)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clusters based on \"purchase behavior\"\n", "Note: Some thought is needed about brand loyalty. For brand loyalty indicators, we have data on \n", " \n", "1. percent of purchases devoted to major brands (i.e. is a customer a total devotee of brand A?), `brandIndicator` \n", "2. a catch-all variable for percent of purchases devoted to other smaller brands (to reduce complexity of analysis), and `otherBrandIndicator`\n", "3. a derived variable that indicates the maximum share devoted to any one brand. \n", " \n", "Since CRISA is compiling this data for general marketing use, and not on behalf of one particular brand, we can\n", "say a customer who is fully devoted to brand A is similar to a customer fully devoted to brand B - both\n", "are fully loyal customers in their behavior. But if we include all the brand shares in the clustering, the\n", "analysis will treat those two customers as very different. \n", "\n", "1. Number of different brands: `No. of Brands`\n", "2. Switching between brands: `Brand Runs`\n", "3. Proportion of purchases that go to different brands: We use the information in the `brandIndicator` to determine the maximum proportion a customer spends on one brand (new variable `maxBrandIndicator`)\n", "\n", "We derive the value of `maxBrandIndicator` by taking the maximum of all specific brand indicators." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "bathSoap_df['maxBrandIndicator'] = bathSoap_df[brandIndicator].max(axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this analysis, we use all `purchaseIndicator`, `maxBrandIndicator` and `otherBrandIndicator` as a description of the customers purchase behavior" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['No. of Brands', 'Brand Runs', 'Total Volume', 'No. of Trans', 'Value', 'Trans / Brand Runs', 'Vol/Tran', 'Avg. Price ', 'Others 999', 'maxBrandIndicator']\n" ] } ], "source": [ "behaviorIndicator = list(purchaseIndicator) + list(otherBrandIndicator) + ['maxBrandIndicator']\n", "print(behaviorIndicator)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Normalizing the data and definition of helper functions" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Normalize the data\n", "bathSoap_df_norm = (bathSoap_df - bathSoap_df.mean())/bathSoap_df.std()\n", "\n", "def clusterSizes(kmeans):\n", " return pd.Series(kmeans.labels_).value_counts().sort_index()\n", "\n", "\n", "def clusterCenters(kmeans, indicator):\n", " return bathSoap_df_norm[indicator].groupby(kmeans.labels_).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Two clusters" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 317\n", "1 283\n", "dtype: int64\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", " warnings.warn(\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
No. of BrandsBrand RunsTotal VolumeNo. of TransValueTrans / Brand RunsVol/TranAvg. PriceOthers 999maxBrandIndicator
00.4836110.6328650.1582220.5221150.306960-0.261283-0.2853830.2796890.488928-0.588026
1-0.541712-0.708898-0.177232-0.584843-0.3438380.2926740.319669-0.313291-0.5476690.658673
\n", "
" ], "text/plain": [ " No. of Brands Brand Runs Total Volume No. of Trans Value \\\n", "0 0.483611 0.632865 0.158222 0.522115 0.306960 \n", "1 -0.541712 -0.708898 -0.177232 -0.584843 -0.343838 \n", "\n", " Trans / Brand Runs Vol/Tran Avg. Price Others 999 maxBrandIndicator \n", "0 -0.261283 -0.285383 0.279689 0.488928 -0.588026 \n", "1 0.292674 0.319669 -0.313291 -0.547669 0.658673 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clusters = KMeans(n_clusters=2, random_state=1).fit(bathSoap_df_norm[behaviorIndicator])\n", "print(clusterSizes(clusters))\n", "clusterCenters(clusters, behaviorIndicator)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comment: The two clusters are well-separated on everything, except transaction volume. \n", "- Cluster 0 (n=317) is high activity & value, with low loyalty. \n", "- Cluster 1 (n=283) is the reverse. \n", "\n", "(\"Value\" here is the meaning attached to the variable - total dollar value of purchases, not some broader meaning.)\n", "Note: Due to the randomization element in the k-means process, different runs can produce different\n", "cluster results." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 169\n", "1 255\n", "2 176\n", "dtype: int64\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", " warnings.warn(\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
No. of BrandsBrand RunsTotal VolumeNo. of TransValueTrans / Brand RunsVol/TranAvg. PriceOthers 999maxBrandIndicator
00.9566711.0893360.6140671.0705820.744401-0.255378-0.2025620.1263560.253722-0.476069
1-0.301247-0.224690-0.531892-0.425770-0.453789-0.246577-0.2662040.2328780.611302-0.536321
2-0.482156-0.7204650.180994-0.411120-0.0573150.6024770.580199-0.458739-1.1293231.234190
\n", "
" ], "text/plain": [ " No. of Brands Brand Runs Total Volume No. of Trans Value \\\n", "0 0.956671 1.089336 0.614067 1.070582 0.744401 \n", "1 -0.301247 -0.224690 -0.531892 -0.425770 -0.453789 \n", "2 -0.482156 -0.720465 0.180994 -0.411120 -0.057315 \n", "\n", " Trans / Brand Runs Vol/Tran Avg. Price Others 999 maxBrandIndicator \n", "0 -0.255378 -0.202562 0.126356 0.253722 -0.476069 \n", "1 -0.246577 -0.266204 0.232878 0.611302 -0.536321 \n", "2 0.602477 0.580199 -0.458739 -1.129323 1.234190 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clusters = KMeans(n_clusters=3, random_state=1).fit(bathSoap_df_norm[behaviorIndicator])\n", "print(clusterSizes(clusters))\n", "clusterCenters(clusters, behaviorIndicator)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comment: \n", "- Cluster 0 (n=166) is not at all loyal, favoring many brands, and of high value.\n", "- Cluster 1 (n=259) is also not very loyal, but may be of the least interest since its customers have the lowest value.\n", "- Cluster 2 (n=175) is highly loyal, favoring main brands and bigger individual purchases, with middling overall value. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clusters based on \"basis for purchase\"\n", "The variables used are: `Pur Vol No Promo - %`, `Pur Vol Promo 6 %`, `Pur Vol Other Promo %`, all price categories, selling propositions 5 and 14 (most people seemed to be responding to one or the other of these promotions/propositions)." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Pur Vol No Promo - %', 'Pur Vol Promo 6 %', 'Pur Vol Other Promo %', 'Pr Cat 1', 'Pr Cat 2', 'Pr Cat 3', 'Pr Cat 4', 'PropCat 5', 'PropCat 14']\n" ] } ], "source": [ "purchaseBasisIndicator = list(withinPromotionIndicator) + list(priceCategoryIndicator) \n", "purchaseBasisIndicator.extend(['PropCat 5', 'PropCat 14'])\n", "print(purchaseBasisIndicator)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "0 523\n", "1 77\n", "dtype: int64\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Pur Vol No Promo - %Pur Vol Promo 6 %Pur Vol Other Promo %Pr Cat 1Pr Cat 2Pr Cat 3Pr Cat 4PropCat 5PropCat 14
0-0.0302420.059272-0.0263650.1165710.166971-0.3514980.0492710.162794-0.352144
10.205410-0.4025860.179079-0.791774-1.1341012.387450-0.334659-1.1057302.391835
\n", "
" ], "text/plain": [ " Pur Vol No Promo - % Pur Vol Promo 6 % Pur Vol Other Promo % Pr Cat 1 \\\n", "0 -0.030242 0.059272 -0.026365 0.116571 \n", "1 0.205410 -0.402586 0.179079 -0.791774 \n", "\n", " Pr Cat 2 Pr Cat 3 Pr Cat 4 PropCat 5 PropCat 14 \n", "0 0.166971 -0.351498 0.049271 0.162794 -0.352144 \n", "1 -1.134101 2.387450 -0.334659 -1.105730 2.391835 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clusters = KMeans(n_clusters=2, random_state=1).fit(bathSoap_df_norm[purchaseBasisIndicator])\n", "print(clusterSizes(clusters))\n", "clusterCenters(clusters, purchaseBasisIndicator)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comment: The two clusters are well-separated across most variables. \n", "- Cluster 0 (n=77) is notable for its responsiveness to price category 3 and selling proposition 14 coupled with aversion to price categories 1 and 2, and selling proposition 5.\n", "- Cluster 1 (n=523) shows a less clear profile" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 97\n", "1 429\n", "2 74\n", "dtype: int64\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", " warnings.warn(\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Pur Vol No Promo - %Pur Vol Promo 6 %Pur Vol Other Promo %Pr Cat 1Pr Cat 2Pr Cat 3Pr Cat 4PropCat 5PropCat 14
0-1.7422341.6050760.8185940.334189-0.265602-0.3692250.4583200.002103-0.373063
10.349656-0.287933-0.2084600.0629320.258829-0.336154-0.0429010.195635-0.335935
20.256679-0.4347170.135482-0.802892-1.1523562.432767-0.352060-1.1369082.436529
\n", "
" ], "text/plain": [ " Pur Vol No Promo - % Pur Vol Promo 6 % Pur Vol Other Promo % Pr Cat 1 \\\n", "0 -1.742234 1.605076 0.818594 0.334189 \n", "1 0.349656 -0.287933 -0.208460 0.062932 \n", "2 0.256679 -0.434717 0.135482 -0.802892 \n", "\n", " Pr Cat 2 Pr Cat 3 Pr Cat 4 PropCat 5 PropCat 14 \n", "0 -0.265602 -0.369225 0.458320 0.002103 -0.373063 \n", "1 0.258829 -0.336154 -0.042901 0.195635 -0.335935 \n", "2 -1.152356 2.432767 -0.352060 -1.136908 2.436529 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clusters = KMeans(n_clusters=3, random_state=1).fit(bathSoap_df_norm[purchaseBasisIndicator])\n", "print(clusterSizes(clusters))\n", "clusterCenters(clusters, purchaseBasisIndicator)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comment: \n", "- Cluster 0 (n=74) has the same profile as cluster 0 in the two cluster case.\n", "- Cluster 1 (n=429) corresponds mainly to cluster 1 in the two cluster case.\n", "- Cluster 2 (n=97) needs promostions, likes price categories 1 and 4, and is not responsive to the two selling propositions. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clusters based on all of the above variables" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "0 527\n", "1 73\n", "dtype: int64\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
No. of Brands0.071443-0.515759
Brand Runs0.106566-0.769322
Total Volume-0.0212060.153087
No. of Trans0.054136-0.390817
Value0.068413-0.493883
Trans / Brand Runs-0.1441001.040286
Vol/Tran-0.0749720.541233
Avg. Price0.180076-1.299999
Others 9990.173881-1.255280
maxBrandIndicator-0.1949491.407373
Pur Vol No Promo - %-0.0275580.198948
Pur Vol Promo 6 %0.057591-0.415757
Pur Vol Other Promo %-0.0286480.206818
Pr Cat 10.110774-0.799701
Pr Cat 20.161535-1.166152
Pr Cat 3-0.3365242.429429
Pr Cat 40.045662-0.329644
PropCat 50.156704-1.131277
PropCat 14-0.3369952.432829
\n", "
" ], "text/plain": [ " 0 1\n", "No. of Brands 0.071443 -0.515759\n", "Brand Runs 0.106566 -0.769322\n", "Total Volume -0.021206 0.153087\n", "No. of Trans 0.054136 -0.390817\n", "Value 0.068413 -0.493883\n", "Trans / Brand Runs -0.144100 1.040286\n", "Vol/Tran -0.074972 0.541233\n", "Avg. Price 0.180076 -1.299999\n", "Others 999 0.173881 -1.255280\n", "maxBrandIndicator -0.194949 1.407373\n", "Pur Vol No Promo - % -0.027558 0.198948\n", "Pur Vol Promo 6 % 0.057591 -0.415757\n", "Pur Vol Other Promo % -0.028648 0.206818\n", "Pr Cat 1 0.110774 -0.799701\n", "Pr Cat 2 0.161535 -1.166152\n", "Pr Cat 3 -0.336524 2.429429\n", "Pr Cat 4 0.045662 -0.329644\n", "PropCat 5 0.156704 -1.131277\n", "PropCat 14 -0.336995 2.432829" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "combinedIndicator = behaviorIndicator + purchaseBasisIndicator \n", "clusters = KMeans(n_clusters=2, random_state=1).fit(bathSoap_df_norm[combinedIndicator])\n", "print(clusterSizes(clusters))\n", "clusterCenters(clusters, combinedIndicator).transpose()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comment: The two clusters are separated on almost all variables, Value being an important exception. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
SEC2.3719173.424658
FEH2.0474382.054795
MT8.2466797.684932
SEX1.7628081.561644
AGE3.2447822.986301
EDU4.2751422.369863
HS4.1916514.191781
CHILD3.2011393.465753
CS0.9373810.890411
Affluence Index18.1688808.726027
\n", "
" ], "text/plain": [ " 0 1\n", "SEC 2.371917 3.424658\n", "FEH 2.047438 2.054795\n", "MT 8.246679 7.684932\n", "SEX 1.762808 1.561644\n", "AGE 3.244782 2.986301\n", "EDU 4.275142 2.369863\n", "HS 4.191651 4.191781\n", "CHILD 3.201139 3.465753\n", "CS 0.937381 0.890411\n", "Affluence Index 18.168880 8.726027" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def clusterDemographics(kmeans):\n", " return bathSoap_df[demographicIndicators].groupby(kmeans.labels_).mean()\n", "\n", "clusterDemographics(clusters).transpose()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cluster 1 (n=72) is the more loyal, with lower socioeconomic status, educational level, and affluence." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "0 70\n", "1 252\n", "2 278\n", "dtype: int64\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
No. of Brands-0.5838930.177248-0.013647
Brand Runs-0.8005690.416647-0.176098
Total Volume0.083386-0.3076940.257921
No. of Trans-0.4334210.230518-0.099824
Value-0.555616-0.0026530.142308
Trans / Brand Runs1.037226-0.251914-0.032818
Vol/Tran0.513840-0.4961850.320395
Avg. Price-1.3182530.712855-0.314251
Others 999-1.2641360.507151-0.141412
maxBrandIndicator1.421065-0.5371560.129097
Pur Vol No Promo - %0.210216-0.4322230.338867
Pur Vol Promo 6 %-0.4360050.437706-0.286984
Pur Vol Other Promo %0.2142710.152045-0.191778
Pr Cat 1-0.7972260.853929-0.573325
Pr Cat 2-1.220714-0.3517430.626220
Pr Cat 32.495499-0.425507-0.242652
Pr Cat 4-0.336964-0.0845610.161500
PropCat 5-1.142977-0.2170190.484522
PropCat 142.498321-0.425938-0.242972
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
SEC3.4000002.0476192.683453
FEH2.0428571.9166672.169065
MT7.6714297.8174608.633094
SEX1.5428571.7222221.802158
AGE3.0142863.2500003.230216
EDU2.3571434.5158734.039568
HS3.9142863.7460324.665468
CHILD3.5142863.3134923.089928
CS0.8714290.9047620.971223
Affluence Index8.54285720.59920615.910072
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def display_side_by_side(*args):\n", " html_str = ''.join(df.to_html() for df in args)\n", " display_html(html_str.replace('table','table style=\"display:inline\"'),raw=True)\n", "\n", "\n", "combinedIndicator = behaviorIndicator + purchaseBasisIndicator \n", "clusters = KMeans(n_clusters=3, random_state=1).fit(bathSoap_df_norm[combinedIndicator])\n", "print(clusterSizes(clusters))\n", "display_side_by_side(clusterCenters(clusters, combinedIndicator).transpose(),\n", " clusterDemographics(clusters).transpose())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comment:\n", "- Cluster 0: (n=72) Highly loyal, low value, highly responsive to price category 3 and selling proposition 14.\n", "- Cluster 1: (n=252) Responsive to price category 2 and selling proposition 5, otherwise somewhat middling.\n", "- Cluster 2: (n=278) Low brand loyalty, responsive to price category 1" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "0 108\n", "1 70\n", "2 202\n", "3 220\n", "dtype: int64\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
No. of Brands-0.414750-0.5838930.831690-0.374254
Brand Runs-0.323453-0.8005690.968608-0.475846
Total Volume-0.6520970.0833860.2960960.021719
No. of Trans-0.428001-0.4334210.822379-0.407077
Value-0.217635-0.5556160.388601-0.073181
Trans / Brand Runs-0.1372901.037226-0.2979630.010955
Vol/Tran-0.4577230.513840-0.3557630.387861
Avg. Price1.362917-1.3182530.057415-0.302341
Others 9990.523047-1.2641360.371432-0.195586
maxBrandIndicator-0.4286611.421065-0.5547020.267594
Pur Vol No Promo - %0.1932740.210216-0.4777230.276870
Pur Vol Promo 6 %-0.186015-0.4360050.585153-0.307232
Pur Vol Other Promo %-0.0805320.2142710.037105-0.062712
Pr Cat 11.650767-0.7972260.042110-0.595378
Pr Cat 2-0.826802-1.2207140.1330180.672159
Pr Cat 3-0.4786542.495499-0.267225-0.313686
Pr Cat 4-0.405680-0.3369640.0957340.218467
PropCat 5-0.372115-1.142977-0.0717080.612190
PropCat 14-0.4721322.498321-0.270314-0.314949
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123
SEC1.7870373.4000002.3762382.677273
FEH1.6574072.0428572.2376242.068182
MT6.7037047.6714299.0297038.281818
SEX1.4722221.5428571.9207921.763636
AGE3.1574073.0142863.3267333.200000
EDU4.1574072.3571434.6237623.990909
HS3.0370373.9142864.7029704.377273
CHILD3.5462963.5142863.0396043.168182
CS0.7407410.8714291.0445540.940909
Affluence Index17.9444448.54285720.87128715.727273
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "combinedIndicator = behaviorIndicator + purchaseBasisIndicator \n", "clusters = KMeans(n_clusters=4, random_state=1).fit(bathSoap_df_norm[combinedIndicator])\n", "print(clusterSizes(clusters))\n", "display_side_by_side(clusterCenters(clusters, combinedIndicator).transpose(),\n", " clusterDemographics(clusters).transpose())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comment:\n", "- Cluster 0 (n=109) is characterized by low volume, low loyalty, and sensitivity to promotions and price (responsive to cat. 1, unresponsive to 2 and 3), and unmoved by selling proposition. Demographically, it is affluent, of high socio-economic status, and has relatively small family size.\n", "- Cluster 1 (n=201) is distinguished mostly by the purchase behavior variables - it has low brand loyalty together with high value, volume and frequency. The brand switching seems to be intrinsic - this group is not particularly responsive to promotions, pricing or selling propositions. Demographically it is relatively affluent and educated.\n", "- Cluster 2 (n=70) stands out in both groups of variables - it has high loyalty, low value and price per purchase, and very differential response to price (unresponsive to categories 1, 2 and 4, highly responsive to category 3), and selling proposition (unresponsive to #5, highly responsive to #14). Demographically it has low affluence and education.\n", "- Cluster 3 (n=220) is a \"gray\" cluster, it is not characterized by very extreme/distinctive values across all variables, but is responsive to price category 2 and selling proposition 5 (similar to cluster 1 in the 3-cluster analysis). Demographically it is relatively affluent and educated. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 2:\n", "Select what you think is the best segmentation and comment on the characteristics (demographic, brand loyalty, and basis for purchase) of these clusters. (This information would be used to guide the development of advertising and promotional campaigns.)\n", "\n", "There is no single \"right\" approach to clustering; different approaches are feasible depending on\n", "different marketing purposes. CRISA is a marketing agency and owns the data, which it collected at\n", "considerable expense, so it will want to be able to use both the data and the segmentation analysis in\n", "different ways for different clients. Here are just a few possible marketing approaches:\n", "\n", "1. Establishing named customer \"personas,\" corresponding to the cluster segments, for use by a client's sales and marketing teams.\n", "2. Establishing named customer \"personas,\" corresponding to the cluster segments, for use by CRISA in providing marketing services to clients.\n", "3. \"Capture affluent market share\" campaign for a client who wants to target more affluent consumers who are not wedded to their current brand, and secure more brand share. \n", "4. \"Down market\" campaign for a data-poor client to build a \"value\" brand for less affluent consumers, much as Dollar General has done in the U.S. \n", "\n", "note: The difference between #1 and #2 is that #1, being confined to a single client, can use that client's\n", "customer data to refine and do more analysis. #2 would have to rely on the data collected by CRISA." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 3:\n", "Develop a model that classifies the data into these segments. Since this information would most likely be used in targeting direct-mail promotions, it would be useful to select a market segment that would be defined as a _success_ in the classification model.\n", "\n", "This \"down market\" scenario is the one we will explore further to develop a predictive model, and classify people into either \"value conscious\" or not. \"Data poor\" means that the client has, or can get, demographic data on their customers, but not detailed purchase data (particularly involving other brands). So a predictive model is to be built using just demographic data. We will look at the results of clustering into two segments based on CRISA's own detailed purchase data, then classify people into those two segments.\n", "\n", "Recall our characterization of the two segments:\n", "\n", "Comment: The two clusters are separated on almost all variables, Value being an important exception. Cluster 1 (n=72) is the more loyal, with lower socioeconomic status and affluence.\n", "\n", "So our \"success\" category is cluster 1, the less affluent group, lower socioeconomic group, which also turns out to be highly loyal and, as it happens, spends roughly as much as the more affluent group. This is a promising group around which to build a down-market brand strategy.\n", "\n", "To build the model, we will be able to only use demographic information. CRISA will not have the detailed purchase information for its client's customers." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.10/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "0 527\n", "1 73\n", "dtype: int64\n" ] } ], "source": [ "clusters = KMeans(n_clusters=2, random_state=1).fit(bathSoap_df_norm[combinedIndicator])\n", "print(clusterSizes(clusters))\n", "\n", "modelData_df = bathSoap_df[demographicIndicators].copy()\n", "modelData_df = pd.get_dummies(modelData_df, columns=['MT', 'FEH'])\n", "modelData_df['y'] = [1 if label == 1 else 0 for label in clusters.labels_]\n", "\n", "train_df, valid_df = train_test_split(modelData_df, test_size=0.4, random_state=1)\n", "\n", "train_X, train_y = train_df.drop(columns=['y']), train_df['y']\n", "valid_X, valid_y = valid_df.drop(columns=['y']), valid_df['y']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Logistic regression" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Confusion Matrix (Accuracy 0.8792)\n", "\n", " Prediction\n", "Actual 0 1\n", " 0 206 2\n", " 1 27 5\n", "Number of customers of interest in validaton set 32\n", "Number of customers of interest in 20% top-ranked 19\n", "Ratio 0.59\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "model = LogisticRegression(penalty=\"l2\", C=1e42, solver='liblinear')\n", "model.fit(train_X, train_y)\n", "\n", "model_pred = model.predict_proba(valid_X)\n", "result = pd.DataFrame({\n", " 'actual': valid_y,\n", " 'p(0)': [p[0] for p in model_pred],\n", " 'p(1)': [p[1] for p in model_pred],\n", " 'predicted': model.predict(valid_X),\n", "})\n", "result = result.sort_values(by=['p(1)'], ascending=False)\n", "\n", "# confusion matrix\n", "classificationSummary(result.actual, result.predicted)\n", "\n", "ax = gainsChart(result.actual, figsize=[5, 5])\n", "nx = round(valid_df.shape[0] * 0.2)\n", "ny = sum(result.actual[0:nx])\n", "ax.plot([nx, nx,0], [0, ny, ny])\n", "\n", "print('Number of customers of interest in validaton set', valid_y.sum())\n", "print('Number of customers of interest in 20% top-ranked', ny)\n", "print(f'Ratio {ny / valid_y.sum():.2f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Decision tree" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Confusion Matrix (Accuracy 0.8792)\n", "\n", " Prediction\n", "Actual 0 1\n", " 0 201 7\n", " 1 22 10\n", "Number of customers of interest in validaton set 32\n", "Number of customers of interest in 20% top-ranked 18\n", "Ratio 0.56\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "model = DecisionTreeClassifier(max_depth=3, random_state=1)\n", "model.fit(train_X, train_y)\n", "\n", "model_pred = model.predict_proba(valid_X)\n", "result = pd.DataFrame({\n", " 'actual': valid_y,\n", " 'p(0)': [p[0] for p in model_pred],\n", " 'p(1)': [p[1] for p in model_pred],\n", " 'predicted': model.predict(valid_X),\n", "})\n", "result = result.sort_values(by=['p(1)'], ascending=False)\n", "\n", "# confusion matrix\n", "classificationSummary(result.actual, result.predicted)\n", "\n", "ax = gainsChart(result.actual, figsize=[5, 5])\n", "nx = round(valid_df.shape[0] * 0.2)\n", "ny = sum(result.actual[0:nx])\n", "ax.plot([nx, nx,0], [0, ny, ny])\n", "\n", "print('Number of customers of interest in validaton set', valid_y.sum())\n", "print('Number of customers of interest in 20% top-ranked', ny)\n", "print(f'Ratio {ny / valid_y.sum():.2f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Random forest classifier" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Confusion Matrix (Accuracy 0.8542)\n", "\n", " Prediction\n", "Actual 0 1\n", " 0 200 8\n", " 1 27 5\n", "Number of customers of interest in validaton set 32\n", "Number of customers of interest in 20% top-ranked 19\n", "Ratio 0.59\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "model = RandomForestClassifier(random_state=1, n_estimators=100)\n", "model.fit(train_X, train_y)\n", "\n", "model_pred = model.predict_proba(valid_X)\n", "result = pd.DataFrame({\n", " 'actual': valid_y,\n", " 'p(0)': [p[0] for p in model_pred],\n", " 'p(1)': [p[1] for p in model_pred],\n", " 'predicted': model.predict(valid_X),\n", "})\n", "result = result.sort_values(by=['p(1)'], ascending=False)\n", "\n", "# confusion matrix\n", "classificationSummary(result.actual, result.predicted)\n", "\n", "ax = gainsChart(result.actual, figsize=[5, 5])\n", "nx = round(valid_df.shape[0] * 0.2)\n", "ny = sum(result.actual[0:nx])\n", "ax.plot([nx, nx,0], [0, ny, ny])\n", "\n", "print('Number of customers of interest in validaton set', valid_y.sum())\n", "print('Number of customers of interest in 20% top-ranked', ny)\n", "print(f'Ratio {ny / valid_y.sum():.2f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bagging classifier" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Confusion Matrix (Accuracy 0.8542)\n", "\n", " Prediction\n", "Actual 0 1\n", " 0 197 11\n", " 1 24 8\n", "Number of customers of interest in validaton set 32\n", "Number of customers of interest in 20% top-ranked 20\n", "Ratio 0.62\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "model = BaggingClassifier(random_state=1, n_estimators=100)\n", "model.fit(train_X, train_y)\n", "\n", "model_pred = model.predict_proba(valid_X)\n", "result = pd.DataFrame({\n", " 'actual': valid_y,\n", " 'p(0)': [p[0] for p in model_pred],\n", " 'p(1)': [p[1] for p in model_pred],\n", " 'predicted': model.predict(valid_X),\n", "})\n", "result = result.sort_values(by=['p(1)'], ascending=False)\n", "\n", "# confusion matrix\n", "classificationSummary(result.actual, result.predicted)\n", "\n", "ax = gainsChart(result.actual, figsize=[5, 5])\n", "nx = round(valid_df.shape[0] * 0.2)\n", "ny = sum(result.actual[0:nx])\n", "ax.plot([nx, nx,0], [0, ny, ny])\n", "\n", "print('Number of customers of interest in validaton set', valid_y.sum())\n", "print('Number of customers of interest in 20% top-ranked', ny)\n", "print(f'Ratio {ny / valid_y.sum():.2f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summary\n", "The analysis of the varies models favors the boosting model. It has both the highest accuracy and the gains chart shows that more records are correctly assigned for 20% of the ranked validation set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What's next?\n", "Many data mining algorithms are iterative in an mathematical sense - iteration is used to find a good, if\n", "not best, solution. The modeling process itself is also iterative. In initial exploration, we do not seek the\n", "perfect model, merely something to get started. Results are assessed, and we typically continue with a\n", "modified approach.\n", "\n", "Several steps can be explored next to improve predictive performance:\n", "\n", "1. Some of the demographic categorical variables may not have much value being treated as is, as ordered categorical variables. They could be reviewed and turned into binary dummies.\n", "2. Instead of using a two-cluster model, a multi-cluster model could be used in hopes of deriving more distinguishable clusters. The non-success clusters could then be consolidated. For example, cluster #2 in the 4-cluster model is similar to our cluster 1 (\"success\") in the 2-cluster model, only more sharply defined.\n", "3. Demographic predictors could be added to the original clustering process.\n", "4. The clustering process, which includes a randomization component that yields variability in resulting clusters, can be repeated, to ensure that the cluster labels reflect some degree of stability. Repetition should show some clustering results that are consistent across various runs. Choosing for your labels a clustering result that is very inconsistent with the others could mean that you are labeling your market segments according to a chance fluke.\n", "5. In the real world, going beyond the parameters of this case study, CRISA would probably work with the client to add the client's own purchase data to the model to improve it over time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.9" } }, "nbformat": 4, "nbformat_minor": 2 }