### Channel Sales - R Analysis

 UA-60924200-1

Q.1 - Which line of distribution yields the highest source of revenue?  Is that true across all channels and regions?

Highest source of revenue (before imputing outliers)  – Channel – 1 , Region – 3 , Fresh - 29,28,269 MU . This is Not True across all channels and regions as

seen in Excel table and R Bar Charts below .

Also for Channel -1 [ All 3 Regions ] - Highest - FRESH - 40,15,717 MU

For   Channel -2 [ All 3 Regions ] - Highest - GROCERY - 23,17,845 MU

Highest source of revenue ( after imputing outliers)  – Channel – 1 , Region – 3 , Fresh - 21,74,260 MU . This is Not True across all channels and regions as

seen in Excel table and R Bar Charts below .

Also for Channel -1 [ All 3 Regions ] - Highest - FRESH - 31,04,927 MU

For  Channel -2 [ All 3 Regions ] - Highest - GROCERY -  14,41,451 MU  Q.2 - Is there reason to believe that the behaviour of different

lines of distribution is different in different regions? Across

different channels?

ANS - Yes as seen from the summary data presented above , the

Bar Plots , Box Plots and Histograms its clear that – there are

differences in different channels and regions.

Q.3 - Does there exist any particular region where

all distribution items are earning

higher revenue? Earning lower revenue?

ANS-3 - As seen from the summary stats above – there is no Channel and Region combination where “All distribution

items are earning a higher revenue”. Even though Channel -1 Region -2 is having the lowest Overall Sales – 7,19,150 MU ,

still the sales of FROZEN products within this

distribution ,  is – 1,60,861 MU

which is higher than Channel 2 – Regions 1 and 2 which have FROZEN sales = 46,514 and 29,271 respectively .

We impute outliers by capping values using “ifelse” at the Upper Whisker point of the Original data set . The reason we choose the Upper Whisker value

for Imputation :- The Upper Whisker is the “Greatest value within the sample excluding outliers” , this is why we choose the Upper Whisker , in certain text

the  Upper Whisker and MAX value are the same – but we choose upper whisker from package lessR Boxplot command and keep this as standard for

all measures within this project. The code and Boxplots for all other Capped variables can be seen here.

Q.4 - Can you identify any particular region channel combination where the Wholesale Supplier is doing better than anywhere else? Worse than anywhere else?

ANS -    Before Imputing the outliers the data showas that - Max Revenue -  Channel 1 , Region 3  – 57,42,077 MU and Min Revenue - Channel 1 , Region 2  –     7,19,150  MU

After imputing the outliers the data shows that - Max Revenue is still -  Channel 1 , Region 3  – 43,37,098 MU but the Min Revenue has changed to - Channel 2 , Region 1 –  5,14,603  MU

Q.5 - Is it true that there is any dependency between any items of distribution?

ANS - There are different Correlation values for the WCD Variables  as seen in the corrplot matrix here. Detergents_Paper and Grocery have the Highest correlation -  0.924 or 93%. After imputing outliers we still have a high correlation amongst - Detergents_Paper and Grocery – 0.685 or 68% . Thus if the Wholesaler wants to increase the sales of Groceries in a region - they should also focus on - Detergents_Paper and vice versa as these are showing High Correlations. We cannot asses dependency easily without conducting  logistic regression which is out of scope for this project.

Q6- Look at histograms and boxplots overall as well as across Regions, Channels and possibly Region*Channel and compare.?

ANS - We look at Boxplots for all Variables within the Group Channel-1 : Region -1 , Channel-1 : Region -2, C-1 : R-3, C-2:R-1,C3:R2 and C3:R3. We use filtering to conduct a pivot  analysis within the various grouping of variables .

Q7 - Are there any outliers? If outliers exist, take a decision regarding whether to eliminate some points from analysis, or impute outlier values, if so with what logic.

Be prepared to explain the decision. Provide basic summary statistics and reconcile those to the histograms and boxplots. If any outlier is rejected or imputed,

summary statistics in both scenarios are to be compared to understand whether the differences are acceptable or not.

ANS – Outliers have been imputed – summary stats and graphs are provided as comparison for the same . LINK

# - ________________R Code as seen below _______________

R version 3.1.3 (2015-03-09) -- "Smooth Sidewalk"

> library(plyr)

> library(plyr); library(dplyr)

# - Summary stats - using function aggregate(x,by,FUN) – where x is our dataset – WCD - Wholesale+customers+data.csv , by is the list by which we want to Group the aggregated data -, FUN is Sum as we want to add all variable values .

 > aggCH_RG <-aggregate(WCD,by=list(Group.Channel=Channel,Group.Region=Region), FUN=sum, na.rm=TRUE) > aggCH_RG   Group.Channel Group.Region Channel Region   Fresh    Milk 1             1            1      59     59  761233  228342 2             2            1      36     18   93600  194112 3             1            2      28     56  326215   64519 4             2            2      38     38  138506  174625 5             1            3     211    633 2928269  735753 6             2            3     210    315 1032308 1153006   Grocery Frozen Detergents_Paper Delicassen 1  237542 184512            56081      70632 2  332495  46514           148055      33695 3  123074 160861            13516      30965 4  310200  29271           159795      23541 5  820101 771606           165990     320358 6 1675150 158886           724420     191752

BarPlot - Channel -1 - Region -1 , 2 and 3                                           BarPlot - Channel -2 - Region -1 , 2 and 3  ## - We also create BarPlots with ggplot - for which we filter the Imputed data I_W and create a Long data.frame - Long2_VAR_C1_R1

## - Code as seen below - the ggplot Barplot seen below is for - Channel -1 Region -1 , plotted with Imputed Outliers data.frame - Long2_VAR_C1_R1 # - Summary tables similar in structure to the Pivot Table from Excel are created in R -  code as shown below – we first create grouped_df [ a grouped data frame] using the filter

command . This grouped_df provides us sums for variables within a particular Channel and Region combination - like a Pivot Table in Excel would do .

# - We then take these values and assign them to a matrix - which is further assigned to a table object and the Barplot is created from the table object .

> library(plyr); library(dplyr)

# --------- From Summary - Summ_Ch_Rng_ create a Table to plot BARPLOTS – these BARPLOTS display Sum Values of Variables - grouped basis Chanel and Region .

# --------- Kindly note that the values on the scale of the Y AXIS have been divided by 100 so that the Barplot could be shown within manageable proportions.

> Summ_CH1_Rng1= filter(WCD,Channel %in% c("1"),Region %in% c("1"))

> Summ_CH1_Rng1=group_by(Summ_CH1_Rng1,Channel,Region)

> Summ_CH1_Rng1=summarise(Summ_CH1_Rng1,sum_fsh=sum(Fresh),sum_mlk=sum(Milk),sum_groc=sum(Grocery),sum_froz=sum(Frozen),sum_detppr=sum(Detergents_Paper),

sum_Delicacy=sum(Delicassen))

>

> Summ_CH1_Rng1

Source: local data frame [1 x 8]

Groups: Channel

Channel Region sum_fsh sum_mlk sum_groc sum_froz sum_detppr sum_Delicacy

1       1      1  761233  228342   237542   184512      56081        70632

C1R1<-matrix(c(761233/100,228342/100,237542/100,184512/100,56081/100,70632/100),ncol=6)

> colnames <- c("sum_fsh","SUM_2","Sum_3","Sum_4","Sum_5","Sum_6")

> rownames <- c("Channel-1,Range-1")

> C1_R1<- as.table(C1R1)

>

> C1_R1

A       B       C       D       E       F

A 7612.33 2283.42 2375.42 1845.12  560.81  706.32

> colnames(C1_R1, do.NULL = FALSE)

 "A" "B" "C" "D" "E" "F"

> colnames(C1_R1) <- c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen")

> rownames(C1_R1) <- c("Channel-1,Region-1:-")

>

> C1_R1

Sum_Fresh Sum_Milk Sum_Grocery Sum_Frozen

Channel-1,Region-1:-   7612.33  2283.42     2375.42    1845.12

Sum_Detergents_Paper Sum_Delicassen

Channel-1,Region-1:-               560.81         706.32

#-------------

 Summ_CH1_Rng2= filter(WCD,Channel %in% c("1"),Region %in% c("2"))Summ_CH1_Rng2=group_by(Summ_CH1_Rng2,Channel,Region)Summ_CH1_Rng2=summarise(Summ_CH1_Rng2,sum_fsh=sum(Fresh),sum_mlk=sum(Milk),sum_groc=sum(Grocery),sum_froz=sum(Frozen),sum_detppr=sum(Detergents_Paper),sum_Delicacy=sum(Delicassen))View(Summ_CH1_Rng2)> head(Summ_CH1_Rng2)  sum_fsh sum_mlk sum_groc sum_froz sum_detppr sum_Delicacy1  326215   64519   123074   160861      13516        30965> C1R2<-matrix(c(326215/100,64519/100,123074/100,160861/100,13516/100,30965/100),ncol=6)> colnames <- c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen")> rownames <- c("Channel-1,Region-2")> C1_R2<- as.table(C1R2)> colnames(C1_R2, do.NULL = FALSE) "Sum_Fresh"            "Sum_Milk"             "Sum_Grocery"          "Sum_Frozen"          "Sum_Detergents_Paper" "Sum_Delicassen"     > colnames(C1_R2) <- c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen")> rownames(C1_R2) <- c("Channel-1,Region-2:-")>> C1_R2                     Sum_Fresh Sum_Milk Sum_Grocery Sum_FrozenChannel-1,Region-2:-   3262.15   645.19     1230.74    1608.61                     Sum_Detergents_Paper Sum_DelicassenChannel-1,Region-2:-               135.16         309.65

#-------------

> Summ_CH1_Rng3= filter(WCD,Channel %in% c("1"),Region %in% c("3"))

> Summ_CH1_Rng3=group_by(Summ_CH1_Rng3,Channel,Region)

> Summ_CH1_Rng3=summarise(Summ_CH1_Rng3,sum_fsh=sum(Fresh),sum_mlk=sum(Milk),sum_groc=sum(Grocery),sum_froz=sum(Frozen),sum_detppr=sum(Detergents_Paper),sum_Delicacy=sum(Delicassen))

sum_fsh sum_mlk sum_groc sum_froz sum_detppr sum_Delicacy

1 2928269  735753   820101   771606     165990       320358

> C1R3<-matrix(c(2928269/100, 735753/100, 820101/100, 771606/100, 165990/100, 320358/100),ncol=6)

> colnames <- c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen")

> rownames <- c("Channel-1,Region-3")

> C1_R3<- as.table(C1R3)

 > colnames(C1_R3, do.NULL = FALSE)  "Sum_Fresh"            "Sum_Milk"             "Sum_Grocery"          "Sum_Frozen"           "Sum_Detergents_Paper" "Sum_Delicassen"      > colnames(C1_R3) <- c("Sum_Fresh","Sum_Milk","Sum_Grocery", "Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen") > rownames(C1_R3) <- c("Channel-1,Region-3:-") > > C1_R3                      Sum_Fresh Sum_Milk Sum_Grocery Sum_Frozen Sum_Detergents_Paper Channel-1,Region-3:-  29282.69  7357.53     8201.01    7716.06              1659.90                      Sum_Delicassen Channel-1,Region-3:-        3203.58

# ---------# ---------# ---------# ---------# ---------# ---------

Summ_CH2_Rng1= filter(WCD,Channel %in% c("2"),Region %in% c("1"))

> Summ_CH2_Rng1=group_by(Summ_CH2_Rng1,Channel,Region)

> Summ_CH2_Rng1=summarise(Summ_CH2_Rng1,sum_fsh=sum(Fresh),sum_mlk=sum(Milk),sum_groc=sum(Grocery),sum_froz=sum(Frozen),sum_detppr=sum(Detergents_Paper),sum_Delicacy=sum(Delicassen))

>

sum_fsh sum_mlk sum_groc sum_froz sum_detppr sum_Delicacy

1   93600  194112   332495    46514     148055        33695

>

C2R1<-matrix(c(93600/100, 194112/100, 332495/100, 46514/100, 148055/100, 33695/100),ncol=6)

colnames <- c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen")

rownames <- c("Channel-2,Region-1")

C2_R1<- as.table(C2R1)

colnames(C2_R1, do.NULL = FALSE)

colnames(C2_R1) <- c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen")

rownames(C2_R1) <- ("Channel-2,Region-1:-")

> C2_R1

Sum_Fresh Sum_Milk Sum_Grocery Sum_Frozen

Channel-2,Region-1:-    936.00  1941.12     3324.95     465.14

Sum_Detergents_Paper Sum_Delicassen

Channel-2,Region-1:-              1480.55         336.95

# ----------

Summ_CH2_Rng2= filter(WCD,Channel %in% c("2"),Region %in% c("2"))

> Summ_CH2_Rng2=group_by(Summ_CH2_Rng2,Channel,Region)

> Summ_CH2_Rng2=summarise(Summ_CH2_Rng2,sum_fsh=sum(Fresh),sum_mlk=sum(Milk),sum_groc=sum(Grocery),sum_froz=sum(Frozen),sum_detppr=sum(Detergents_Paper),sum_Delicacy=sum(Delicassen))

Source: local data frame [1 x 8]

Groups: Channel

Channel Region sum_fsh sum_mlk sum_groc sum_froz sum_detppr sum_Delicacy

1       2      2  138506  174625   310200    29271     159795        23541

> C2R2<-matrix(c(138506/100, 174625/100, 310200/100, 29271/100, 159795/100, 23541/100),ncol=6)

> colnames <- c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen")

> rownames <- c("Channel-2,Region-2")

> C2_R2<- as.table(C2R2)

> colnames(C2_R2, do.NULL = FALSE)

 "A" "B" "C" "D" "E" "F"

> colnames(C2_R2) <- c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen")

> rownames(C2_R2) <- ("Channel-2,Region-2:-")

> C2_R2

Sum_Fresh Sum_Milk Sum_Grocery Sum_Frozen

Channel-2,Region-2:-   1385.06  1746.25     3102.00     292.71

Sum_Detergents_Paper Sum_Delicassen

Channel-2,Region-2:-              1597.95         235.41

# ----------

Summ_CH2_Rng3= filter(WCD,Channel %in% c("2"),Region %in% c("3"))

Summ_CH2_Rng3=group_by(Summ_CH2_Rng3,Channel,Region)

Summ_CH2_Rng3=summarise(Summ_CH2_Rng3,sum_fsh=sum(Fresh),sum_mlk=sum(Milk),sum_groc=sum(Grocery),sum_froz=sum(Frozen),sum_detppr=sum(Detergents_Paper),sum_Delicacy=sum(Delicassen))

> Summ_CH2_Rng3

Source: local data frame [1 x 8]

Groups: Channel

Channel Region sum_fsh sum_mlk sum_groc sum_froz sum_detppr sum_Delicacy

1       2      3 1032308 1153006  1675150   158886     724420       191752

> C2R3<-matrix(c(1032308/100, 1153006/100, 1675150/100, 158886/100, 724420/100, 191752/100),ncol=6)

> colnames <- c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen")

> rownames <- c("Channel-2,Region-3")

> C2_R3<- as.table(C2R3)

> colnames(C2_R3, do.NULL = FALSE)

 "A" "B" "C" "D" "E" "F"

> colnames(C2_R3) <- c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen")

> rownames(C2_R3) <- ("Channel-2,Region-3:-")

> colnames(C2_R3) <- c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen")

> rownames(C2_R3) <- ("Channel-2,Region-3:-")

> C2_R3

Sum_Fresh Sum_Milk Sum_Grocery Sum_Frozen

Channel-2,Region-3:-  10323.08 11530.06    16751.50    1588.86

Sum_Detergents_Paper Sum_Delicassen

Channel-2,Region-3:-              7244.20        1917.52

# ----------

Tried creating PivotBar Charts using package PivotBarchart from GitHub - https://github.com/smartinsightsfromdata/rpivotTable

But the same does not render like the Excel Bar Charts created from Pivot Tables - thus creating BarPlots in R from Summary data created using filter as seen above ...

# ---------- certain pics to show what vcan be done by rpivotTable()

Channel -1 , Region -1 - Fresh and Milk - using rendererName = "BarchartC3"   As seen here in the excel screen capture the Channel -1 Region -1 - Max value for Fresh is - 56083 , the same is also plotted by the rpivotTable() as seen below -

Channel -1 Region -1 - Max value for Fresh is - 56083 as plotted below by - rpivotTable() 