Q.1 - Which line of distribution yields
the highest source of revenue? Is that
true across all channels and regions? Highest source of revenue (before imputing outliers) – Channel – 1 , Region – 3 , Fresh - 29,28,269
MU . This is Not True across all channels and regions as
seen in Excel table and R Bar Charts below . Also for Channel -1 [ All 3 Regions ] - Highest - FRESH - 40,15,717 MU For Channel -2 [ All 3 Regions ] - Highest - GROCERY - 23,17,845 MU
Highest source of revenue ( after imputing outliers) – Channel – 1 , Region – 3 , Fresh - 21,74,260 MU . This is Not True across all channels and regions as seen in Excel table and R Bar Charts below . Also for Channel -1 [ All 3 Regions ] - Highest - FRESH - 31,04,927 MU For Channel -2 [ All 3 Regions ] - Highest - GROCERY - 14,41,451 MU

Q.2 - Is there reason to believe that the
behaviour of different lines of distribution is different in different regions?
Across different channels? ANS - Yes as seen from the summary data
presented above , the Bar
Plots , Box
Plots and Histograms
its clear that – there are differences in different channels and regions.
Q.3 - Does there exist any particular
region where all distribution items are earning higher revenue? Earning lower
revenue? ANS-3 - As seen from the summary stats above
– there is no Channel and Region
combination where “All distribution items are earning a higher revenue”. Even though Channel -1 Region -2 is
having the lowest Overall Sales – 7,19,150 MU , still the sales of FROZEN
products within this distribution , is –
1,60,861 MU which is higher than
Channel 2 – Regions 1 and 2 which have FROZEN sales = 46,514 and 29,271 respectively
. We
impute outliers by capping
values using “ifelse”
at the Upper
Whisker point of the Original data set . The reason we choose the Upper
Whisker value for Imputation :- The Upper Whisker is the “Greatest value within
the sample excluding outliers” , this is why we choose the Upper Whisker , in
certain text the Upper Whisker and
MAX value are the same – but we choose upper whisker from package
lessR Boxplot command and keep this as standard for all measures within
this project. The code and Boxplots for all other Capped variables can be seen
here.
Q.4 - Can you identify any particular
region channel combination where the Wholesale Supplier is doing better than
anywhere else? Worse than anywhere else? ANS - Before Imputing the outliers the data showas that - Max Revenue - Channel 1 , Region
3 – 57,42,077 MU and Min Revenue - Channel 1 , Region 2 –
7,19,150 MU After imputing the outliers the data shows that - Max Revenue is still - Channel 1 , Region 3 – 43,37,098 MU but the Min Revenue has changed to - Channel 2 , Region 1 – 5,14,603 MU
Q.5 - Is it true that there is any
dependency between any items of distribution? ANS - There are different Correlation
values for the WCD Variables as seen in
the corrplot matrix here. Detergents_Paper
and Grocery have the Highest correlation -
0.924 or 93%. After imputing outliers we still have a high correlation
amongst - Detergents_Paper
and Grocery – 0.685 or 68% . Thus if the Wholesaler wants to increase the
sales of Groceries in a region - they should also focus on - Detergents_Paper
and vice versa as these are showing High Correlations. We cannot asses dependency easily without
conducting logistic regression which is
out of scope for this project.
Q6- Look at histograms and boxplots overall as well as across
Regions, Channels and possibly Region*Channel and compare.? ANS - We look at Boxplots for all Variables within the Group
Channel-1 : Region -1 , Channel-1 : Region -2, C-1 : R-3, C-2:R-1,C3:R2 and
C3:R3. We use filtering to conduct a pivot analysis within the
various grouping of variables .
Q7 - Are there
any outliers? If outliers exist, take a decision regarding whether to eliminate
some points from analysis, or impute outlier values, if so with what logic. Be
prepared to explain the decision. Provide
basic summary statistics and reconcile those to the histograms and boxplots. If
any outlier is rejected or imputed, summary statistics in both scenarios are to
be compared to understand whether the differences are acceptable or not. ANS – Outliers have been imputed – summary stats and graphs
are provided as comparison for the same . LINK
# - ________________R Code as seen below _______________ R version 3.1.3 (2015-03-09) -- "Smooth Sidewalk" > library(plyr) > library(plyr); library(dplyr) # - Summary
stats - using function aggregate(x,by,FUN) – where x
is our dataset – WCD - Wholesale+customers+data.csv , by is the list by which we want to Group the aggregated data -, FUN is Sum as we want to add all variable values .
> aggCH_RG
<-aggregate(WCD,by=list(Group.Channel=Channel,Group.Region=Region),
FUN=sum, na.rm=TRUE)
> aggCH_RG
Group.Channel Group.Region Channel
Region Fresh Milk
1 1 1 59
59 761233 228342
2 2 1
36 18 93600
194112
3 1 2 28
56 326215 64519
4 2 2 38
38 138506 174625
5 1 3 211
633 2928269 735753
6 2 3 210
315 1032308 1153006
Grocery Frozen Detergents_Paper
Delicassen
1 237542 184512 56081 70632
2 332495
46514 148055 33695
3 123074 160861 13516 30965
4 310200
29271 159795 23541
5 820101
771606 165990 320358
6 1675150 158886 724420 191752
|
|
|
BarPlot - Channel -1 - Region -1 , 2 and 3 BarPlot - Channel -2 - Region -1 , 2 and 3 
## - We also create BarPlots with ggplot - for which we filter the Imputed data I_W and create a Long data.frame - Long2_VAR_C1_R1 ## - Code as seen below - the ggplot Barplot seen below is for - Channel -1 Region -1 , plotted with Imputed Outliers data.frame - Long2_VAR_C1_R1
# - Summary tables similar in structure to the Pivot Table from Excel are created in R - code as shown below – we first create grouped_df [ a grouped data frame] using the filter command . This grouped_df provides us sums for variables within a particular Channel and Region combination - like a Pivot Table in Excel would do . # - We then take these values and assign them to a matrix - which is further assigned to a table object and the Barplot is created from the table object .
> library(plyr); library(dplyr)
# ---------- From Summary - Summ_Ch_Rng_ create a Table
to plot BARPLOTS – these BARPLOTS display Sum Values of Variables - grouped basis Chanel and
Region .
# ---------- Kindly note that the values on the scale of the Y AXIS have been divided by 100 so that the Barplot could be shown within manageable proportions.
> Summ_CH1_Rng1= filter(WCD,Channel %in% c("1"),Region %in% c("1")) > Summ_CH1_Rng1=group_by(Summ_CH1_Rng1,Channel,Region) > Summ_CH1_Rng1=summarise(Summ_CH1_Rng1,sum_fsh=sum(Fresh),sum_mlk=sum(Milk),sum_groc=sum(Grocery),sum_froz=sum(Frozen),sum_detppr=sum(Detergents_Paper), sum_Delicacy=sum(Delicassen)) > > Summ_CH1_Rng1 Source: local data frame [1 x 8] Groups: Channel Channel Region sum_fsh sum_mlk sum_groc sum_froz sum_detppr sum_Delicacy 1 1 1 761233 228342 237542 184512 56081 70632
C1R1<-matrix(c(761233/100,228342/100,237542/100,184512/100,56081/100,70632/100),ncol=6) > colnames <-
c("sum_fsh","SUM_2","Sum_3","Sum_4","Sum_5","Sum_6") > rownames <- c("Channel-1,Range-1") > C1_R1<- as.table(C1R1) > > C1_R1 A B
C D E
F A 7612.33 2283.42 2375.42 1845.12 560.81
706.32 > colnames(C1_R1, do.NULL = FALSE) [1] "A" "B" "C" "D"
"E" "F" > colnames(C1_R1) <- c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen") > rownames(C1_R1) <- c("Channel-1,Region-1:-") > > C1_R1
Sum_Fresh Sum_Milk Sum_Grocery Sum_Frozen Channel-1,Region-1:-
7612.33 2283.42 2375.42
1845.12
Sum_Detergents_Paper Sum_Delicassen
Channel-1,Region-1:-
560.81 706.32
#-------------
Summ_CH1_Rng2= filter(WCD,Channel %in% c("1"),Region %in% c("2")) Summ_CH1_Rng2=group_by(Summ_CH1_Rng2,Channel,Region) Summ_CH1_Rng2=summarise(Summ_CH1_Rng2,sum_fsh=sum(Fresh),sum_mlk=sum(Milk),sum_groc=sum(Grocery),sum_froz=sum(Frozen),sum_detppr=sum(Detergents_Paper),sum_Delicacy=sum(Delicassen))
View(Summ_CH1_Rng2) > head(Summ_CH1_Rng2) sum_fsh sum_mlk sum_groc sum_froz sum_detppr sum_Delicacy 1 326215 64519 123074 160861 13516 30965
> C1R2<-matrix(c(326215/100,64519/100,123074/100,160861/100,13516/100,30965/100),ncol=6) > colnames <- c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen") > rownames <- c("Channel-1,Region-2") > C1_R2<- as.table(C1R2) > colnames(C1_R2, do.NULL = FALSE) [1] "Sum_Fresh" "Sum_Milk" [3] "Sum_Grocery" "Sum_Frozen" [5] "Sum_Detergents_Paper" "Sum_Delicassen" > colnames(C1_R2) <- c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen") > rownames(C1_R2) <- c("Channel-1,Region-2:-") > > C1_R2 Sum_Fresh Sum_Milk Sum_Grocery Sum_Frozen Channel-1,Region-2:- 3262.15 645.19 1230.74 1608.61 Sum_Detergents_Paper Sum_Delicassen Channel-1,Region-2:- 135.16 309.65
|
|
#------------- > Summ_CH1_Rng3= filter(WCD,Channel %in% c("1"),Region %in% c("3")) > Summ_CH1_Rng3=group_by(Summ_CH1_Rng3,Channel,Region) > Summ_CH1_Rng3=summarise(Summ_CH1_Rng3,sum_fsh=sum(Fresh),sum_mlk=sum(Milk),sum_groc=sum(Grocery),sum_froz=sum(Frozen),sum_detppr=sum(Detergents_Paper),sum_Delicacy=sum(Delicassen))
> head(Summ_CH1_Rng3) sum_fsh sum_mlk sum_groc sum_froz sum_detppr sum_Delicacy 1 2928269 735753 820101 771606 165990 320358
> C1R3<-matrix(c(2928269/100, 735753/100, 820101/100, 771606/100, 165990/100, 320358/100),ncol=6) > colnames <- c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen") > rownames <- c("Channel-1,Region-3") > C1_R3<- as.table(C1R3)
> colnames(C1_R3, do.NULL = FALSE)
[1] "Sum_Fresh" "Sum_Milk" "Sum_Grocery" "Sum_Frozen"
[5] "Sum_Detergents_Paper"
"Sum_Delicassen"
> colnames(C1_R3) <-
c("Sum_Fresh","Sum_Milk","Sum_Grocery",
"Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen")
> rownames(C1_R3) <-
c("Channel-1,Region-3:-")
>
> C1_R3
Sum_Fresh Sum_Milk Sum_Grocery Sum_Frozen Sum_Detergents_Paper
Channel-1,Region-3:-
29282.69 7357.53 8201.01 7716.06 1659.90
Sum_Delicassen
Channel-1,Region-3:-
3203.58
|
|
| # ---------- # ---------- # ---------- # ---------- # ---------- # ----------
Summ_CH2_Rng1= filter(WCD,Channel %in% c("2"),Region %in% c("1")) > Summ_CH2_Rng1=group_by(Summ_CH2_Rng1,Channel,Region) > Summ_CH2_Rng1=summarise(Summ_CH2_Rng1,sum_fsh=sum(Fresh),sum_mlk=sum(Milk),sum_groc=sum(Grocery),sum_froz=sum(Frozen),sum_detppr=sum(Detergents_Paper),sum_Delicacy=sum(Delicassen)) > > head(Summ_CH2_Rng1) sum_fsh sum_mlk sum_groc sum_froz sum_detppr sum_Delicacy 1 93600 194112 332495 46514 148055 33695 > C2R1<-matrix(c(93600/100, 194112/100, 332495/100, 46514/100, 148055/100, 33695/100),ncol=6) colnames <-
c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen") rownames <- c("Channel-2,Region-1") C2_R1<- as.table(C2R1) colnames(C2_R1, do.NULL = FALSE) colnames(C2_R1) <-
c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen")
rownames(C2_R1) <- ("Channel-2,Region-1:-") > > C2_R1 Sum_Fresh Sum_Milk Sum_Grocery Sum_Frozen Channel-2,Region-1:- 936.00 1941.12 3324.95 465.14 Sum_Detergents_Paper Sum_Delicassen Channel-2,Region-1:- 1480.55 336.95 > # ---------- Summ_CH2_Rng2= filter(WCD,Channel %in% c("2"),Region %in% c("2")) > Summ_CH2_Rng2=group_by(Summ_CH2_Rng2,Channel,Region) > Summ_CH2_Rng2=summarise(Summ_CH2_Rng2,sum_fsh=sum(Fresh),sum_mlk=sum(Milk),sum_groc=sum(Grocery),sum_froz=sum(Frozen),sum_detppr=sum(Detergents_Paper),sum_Delicacy=sum(Delicassen)) > head(Summ_CH2_Rng2) Source: local data frame [1 x 8] Groups: Channel
Channel Region sum_fsh sum_mlk sum_groc sum_froz sum_detppr sum_Delicacy 1 2 2 138506 174625 310200 29271 159795 23541 > > > C2R2<-matrix(c(138506/100, 174625/100, 310200/100, 29271/100, 159795/100, 23541/100),ncol=6) > colnames <- c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen") > rownames <- c("Channel-2,Region-2") > C2_R2<- as.table(C2R2) > > colnames(C2_R2, do.NULL = FALSE) [1] "A" "B" "C" "D" "E" "F" > colnames(C2_R2) <- c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen") > rownames(C2_R2) <- ("Channel-2,Region-2:-") > > C2_R2 Sum_Fresh Sum_Milk Sum_Grocery Sum_Frozen Channel-2,Region-2:- 1385.06 1746.25 3102.00 292.71 Sum_Detergents_Paper Sum_Delicassen Channel-2,Region-2:- 1597.95 235.41 >
# ----------
Summ_CH2_Rng3= filter(WCD,Channel %in% c("2"),Region %in% c("3")) Summ_CH2_Rng3=group_by(Summ_CH2_Rng3,Channel,Region) Summ_CH2_Rng3=summarise(Summ_CH2_Rng3,sum_fsh=sum(Fresh),sum_mlk=sum(Milk),sum_groc=sum(Grocery),sum_froz=sum(Frozen),sum_detppr=sum(Detergents_Paper),sum_Delicacy=sum(Delicassen))
> Summ_CH2_Rng3 Source: local data frame [1 x 8] Groups: Channel
Channel Region sum_fsh sum_mlk sum_groc sum_froz sum_detppr sum_Delicacy 1 2 3 1032308 1153006 1675150 158886 724420 191752
> C2R3<-matrix(c(1032308/100, 1153006/100, 1675150/100, 158886/100, 724420/100, 191752/100),ncol=6) > colnames <- c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen") > rownames <- c("Channel-2,Region-3") > C2_R3<- as.table(C2R3) > > colnames(C2_R3, do.NULL = FALSE) [1] "A" "B" "C" "D" "E" "F" > colnames(C2_R3) <- c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen") > rownames(C2_R3) <- ("Channel-2,Region-3:-") > > > colnames(C2_R3) <- c("Sum_Fresh","Sum_Milk","Sum_Grocery","Sum_Frozen","Sum_Detergents_Paper","Sum_Delicassen") > rownames(C2_R3) <- ("Channel-2,Region-3:-") > > C2_R3 Sum_Fresh Sum_Milk Sum_Grocery Sum_Frozen Channel-2,Region-3:- 10323.08 11530.06 16751.50 1588.86 Sum_Detergents_Paper Sum_Delicassen Channel-2,Region-3:- 7244.20 1917.52
> # ---------- Tried creating PivotBar Charts using package PivotBarchart from GitHub - https://github.com/smartinsightsfromdata/rpivotTable But the same does not render like the Excel Bar Charts created from Pivot Tables - thus creating BarPlots in R from Summary data created using filter as seen above ... # ---------- certain pics to show what vcan be done by rpivotTable()
Channel -1 , Region -1 - Fresh and Milk - using rendererName = "BarchartC3"
As seen here in the excel screen capture the Channel -1 Region -1 - Max value for Fresh is - 56083 , the same is also plotted by the rpivotTable() as seen below -
Channel -1 Region -1 - Max value for Fresh is - 56083 as plotted below by - rpivotTable()
|
|
|