
Computational Statistics&Data Analysis38(2002)533–541
www.elsevier.com/locate/csda Data mining of association structures to model consumerbehaviour
Paolo Giudici∗,Gianluca Passerone
Dipartimento di Economia Politica e Metodi Quantitativi,Universita di Pavio,Via San Felice5,
I-27100,Pavia,Italy
Abstract
We describe how statistical association models and,speciÿcally,log linear and graphical models, can be usefully employed to study consumer behaviours.We describe some methodological problems related to the implementation of discrete graphical models for market basket analysis data.In particular, we shall discuss model selection procedures.c 2002Elsevier Science B.V.All rights reserved.
Keywords:Bayesian model selection;Graphical log linear models;Market basket analysis;Markov chain monte carlo methods;Odds ratios
1.Introduction
The aim of this paperis to illustr ate the concepts and techniques involved in an actual data mining analysis,which concerns the study of association structures to model consumer behaviours.More speciÿcally,we consider a model-based approach to the marketing technique known as market basket analysis.In this paper weÿrst illustrate the application considered.Then,in Sections3and4,we analyse the avail-able data from a frequentist viewpoint.In Section3,we illustrate an exploratory approach,based on the calculation of the pairwise marginal odds ratios.In Section4, we describe a more structured approach,based on graphical model selection meth-ods.We show that both approaches,especially the second,are not suitable for the analysis of the very sparse contingency table considered in the application.In Section5,we propose an alternative,Bayesian,approach to select association
∗Corresponding author.Tel.:+39-382-386224;fax:+39-382-304226.
E-mail address:giudici@unipv.it(P.Giudici).
0167-9473/02/$-see front matter c 2002Elsevier Science B.V.All rights reserved.
PII:S0167-9473(01)00077-9534P.Giudici,G.Passerone/Computational Statistics&Data Analysis38(2002)533–541 structures in market basket analysis,apply it to the data,and compare the results with the frequentist ones.Finally,Section6,contains some concluding remarks.
2.Market basket analysis
Market basket analysis(MBA)is an applied statistical methodology aimed at iden-tifying associations between consumers buying choices of di erent products,usually inside a speciÿc unit,such as a supermarket.
Within such a unit,the data analysed in an MBA usually consists of all buying transactions carried out by the clients in a certain unit of time.The aim of the analysis is to understand the association structure between the sales of the di erent products available.Once the associations are found,it is possible to plan marketing policies in a betterway.Forinstance,if two pr oducts tur n out to be heavily associated,they should be allocated next to each other in the layout.Furthermore,it will be su cient to put only one of them on promotion,to increase sales of both.
Our available data were kindly provided by AC Nielsen,Italy.It consists of the sales of26categories of products(grouping di erent brands for the same products)in 52periods of1week each,for the year1997,in one large Italian supermarket(about 12,000squared meters),localised in the region of Piemonte(Italy).All categories are of food type and have been chosen among those most sensible to promotional e ects.Data are summarised in26binary variables indicating whether,in each week, a certain category of products was sold above or below the median of the year.For clarity,hereafter we call any one of the26variables as“product”,even though they really describe categories of products.
In other words,the data we analyse can be summarised in a52×26data matrix, with the52rows indicating the aggregate weekly transactions and the26columns describing whether each product was sold,in each particular week,more or less than the median.
MBA aims at identifying recurrent rules within transactions.Two main types of rules are typically searched:association rules,which are the main topic of the present paper,and sequence rules.A correct identiÿcation of such rules depends on the availability of a considerable amount of detailed information that allows to identify customers and follow their behaviours in time.
We remark that our data are much simpler and less expensive to collect than what is analysed in standard MBA.In fact,the typical data consist of all transaction data,in a given period.Our available data allow to obtain results comparable to the standard one and,at the same time,is much less expensive to obtain.It thus allows a more easily implementable procedure than standard MBA.On the other hand,we remark that the absence of individual data forbids to derive customer-speciÿc con-clusions,such as the classiÿcation of consumers in homogeneous groups,according to theirbuying behaviour.
Besides the data on the buys of the26considered products,we have data on the di erent types of promotions applied in di erent weeks in the considered supermar-ket.The absence or presence of a promotion is indicated by a binary variable forP.Giudici,G.Passerone/Computational Statistics&Data Analysis38(2002)533–541535 each type of promotion.There are eight types of promotions available;however,only three show a sensible e ect in changing sales:display(indicating that a product is put on a special location),gift(indicating that a present is associated to the product)and price reduction(indicating that the product is sold at a price lower than the usual).
3.Exploratory model search
In order to study the association structure between the product transactions,we need to analyse a contingency table with a total of226possible levels.Clearly,there will be a tremendous amount of empty cells and,consequently,a full analysis of the table is quite di cult.In order to establish solid conclusions,we have decided to proceed by exploratorily considering all the3252×2tables obtained by considering each possible pairof pr oducts.Foreach table we considerthe mar ginal odds r atio. Foreach odds r atio we have calculated,besides the maximum likelihood estimate, an approximate95%conÿdence interval(for the derivation see e.g.Agresti,1990). We have then decided to declare two variables as being signiÿcatively associated, if a value of unity of the odds ratio,corresponding to marginal independence,was outside the conÿdence interval.
In order to pictorially represent all found marginal associations,we have then proceeded in building a graph whose nodes represent the product binary variables and inserting an edge between any pair of such variables if the two were signiÿcatively associated.We remark that such a graph is not a conditional independence graph, such as those employed in graphical modelling,but a marginal independence graph. Such a graph is depicted in Fig.1.
Note that the obtained graph contains36edges,a relatively low complexity,given the nature of the data,but still di cult to interpret.Therefore,in order to obtain more cautious conclusions,and to better interpret the association structure,we have simpliÿed the previous graph by including edges between variables only if the odds ratio was signiÿcantly larger than5,in absolute value.This choice,which obvi-ously reduces model complexity,leads to obtainÿve distinct clusters of variables, completely disconnected from each other.
This allows to obtain more valid inferential conclusions within each cluster,as they are marginally independent.
For instance,we can,for each cluster,attach the promotion variables to the dataset and check whether promotions signiÿcantly a ect sales.For lack of space,we report here such results for one cluster only,depicted in Fig.2.
From Fig.2note,for instance,that the promotion variable CPZO,indicating a reduction in price,is associated directly to both rice and cookies,but only indi-rectly to milk.This suggests that,to optimise promotions,only one between rice and cookies should be put on promotion.This will determine,as all three products are positively associated,an increase of sales of all of them,in a measure that can be estimated by the found odds ratios.On the other hand,a display promotion should be done on rice,which is the only product directly associated,and this should determine an increase in sale of all three products.536P.Giudici,G.Passerone/Computational Statistics&Data Analysis38(2002)533–541
Fig.1.The selected explorative graph.
Fig.2.One clusterof the selected gr aph.
4.Graphical model search
In this section,we analyse the previous data from a more reÿned statistical model-based viewpoint.We base our analysis on graphical modelling theory(see e.g.Lauritzen,1996,or Whittaker,1990).
A graph,g=(V;E)is a pairof sets:aÿnite set of ver tices,V,and aÿnite set of edges that join pairs of the vertices.Here,given the nature of the data,we only consider undirected edges between vertices and,consequently,undirected graphical models.
A graphical model is a probability distribution,P,which is Markov with respect to a graph g.Concerning the random variables considered,here we shall concentrate on graphs representing only discrete random variables.Our object of inference willP.Giudici,G.Passerone/Computational Statistics&Data Analysis38(2002)533–541537 be a vector of discrete random variables,whose counts are arranged in a contingency table.LetÂ(g)indicate the vectorof the unknown cell pr obabilities. Considera complete sample fr om P.Our objective of interest is structural learning. Statistically this means to choose,on the basis of the observed sample,the most likely graph g(or,equivalently,the most likely graphical model).In data mining,there is typically little a priori knowledge,so one may want to compare all possible graphs for a given set of random variables.
Let L(g)be the likelihood function of a graph g,having observed the data.To carry out model selection,we need to attach a score to each considered model.The classical frequentist score is obtained by maximising the likelihood inÂ(g),forany given graph g.Selection is then typically carried out by stepwise comparisons of models,comparing the corresponding maximised likelihoods.
Given the infeasibility of a direct analyis of the226contingency table available,we have started our analysis from theÿve groups identiÿed previously.For each of them we have followed a stepwise model selection procedure to identify the bestÿtting model for the group.In order to attain this objective,we have considered graphical log linear models,namely log linear models whose generators are the cliques of the graph,and have based the stepwise selection on the likelihood-ratio test.The results obtained do not di ersubstantially with those obtained in the pr evious section.The di erence is that we now consider,in each group,a graphical model based on pair-wise conditional associations and not on pairwise marginal associations.Therefore, bigger contingency tables are analysed,and this will lead to more conservative tests, in favourof the hypotheses of conditional independence.
In fact,the analysis of this very sparse dataset,using discrete graphical models over the6previously found clusters,has led to a more complex model,containing 44signiÿcant edges,9of which describe negative associations.
However,such an analysis is not satisfactory,as it is based on the starting as-sumption of independence between clusters of variables.For space purposes,we have therefore decided not to report more details of the analysis,which can however be found in Passerone and Giudici(2000).On the other hand,in order to jointly analyse all variables,in the next Section,we consider a Bayesian analysis.
5.MCMC Bayesian model search
We now considera B ayesian MB A which,as we shall show,tur ns out to be quite advantageous.Our discussion here is based on the results recently obtained by Giudici and Castelo(2001),to which we r eferforfur therdetails and discussion. The problem with employing classical scores for model selection in data mining is that,as we have already observed,when a large number of variables are considered, stepwise procedures are often very unstable.Furthermore,in data mining there is typically little subject-matter knowledge on which models are substantially important, it is advisable to report conclusions from more than one model.Hence,a model averaging procedure is needed,and classical methods do not provide an easy solution to this.538P.Giudici,G.Passerone/Computational Statistics&Data Analysis38(2002)533–541
The Bayesian approach provides a solution to this problem,as it is based on the comparison of probabilities.The Bayesian model score is a probability,obtained by applying Bayes’theorem to the marginal likelihood of a model.The latter is obtained by integrating the likelihood over all admissible values ofÂ(g).
We remark that another advantage of the Bayesian approach is that it can be equally applied to very small datasets,such as the sparse contingency table considered here.
On the negative side,the Bayesian approach needs to specify prior distributions. As a prior distribution on the unknown cell probabilities we propose to assign a Dirichlet distribution for each node,conditionally on the conÿgurations of its parents, as described in Heckerman et al.(1995),and typically done in the probabilistic expert systems literature.On the other hand,as a prior condition on the model space,we have taken a uniform distribution over the number of considered graphs. Therefore,the Bayesian model score can be simply obtained by normalising the marginal likelihood.
A further problem with the Bayesian approach is the need to solve the highly dimensional integrals involved in the derivation of the model scores.Furthermore, forhigh-dimensional gr aphical modes,such as those we considerher e,the set of all possible models is large,and a full comparison of all the posterior probabilities associated to the competing models becomes infeasible.
The above problems can be solved,at least approximately,by means of Markov Chain Monte Carlo methods(MCMC,for a review see e.g.Brooks,1998).In partic-ular,for model search purposes,a successful MCMC algorithm is the Markov Chain Monte Carlo composition algorithm(MC3),proposed by Madigan and York(1995), that will be employed here.
Considerthe application of MCMC model sear ch to MB A.Ouraim is toÿnd the most likely association structure between the transactions concerning the26products considered.The number of all possible models is very large,that is,about2325. Therefore,given the very large number of considered models,in all of our experi-ments on this dataset we have run a long Markov chain,of a length of n=1,000,000 iterations,in order to achieve practical convergence.
To assess convergence,we haveÿrst entertained several graphical diagnostics, based on monitoring the number of edges present in the graph.The results ob-tained indicate a good degree of convergence of the output.The number of edges present seem to converge around the value of32,slightly less than in the frequentist case.
We remark that the MCMC bayesian model search is done directly on the226 contingency table.In fact,for each considered edge of the graphical model,we can aproximately evaluate its posterior probability of being present in the graph, which gives an important measure of conditional association.Our results show that there are a limited number of edges,out of the325possible,which have a high probability of being present.For instance,if we set a threshold at0.7,only13edges have a higher probability.Note that,according to the Bayesian approach,we can interpret the results in a much clearer way;for instance,by changing the value of the threshold we can obtain graphs of di erent complexities.This bears similarities
P.Giudici,G.Passerone/Computational Statistics&Data Analysis38(2002)533–541539
Fig.3.The Bayesian graph selected.
with what was done in the exploratory approach;however,using a statistically more sound procedure.
In order to pictorially represent such results,and compare them to the frequentist ones,we have built a representative graph,that includes only edges with a probability of being present greater than70%.Fig.3reports such a graph. Furthermore,in the graph we have reported,for each edge estimated to be present, the corresponding posterior probability of their presence.Finally,we have not re-ported nodes which are completely disconnected from other variables,as they are marginally independent from all others.
From Fig.3we remark that,having chosen a high threshold probability,the re-sulting model is rather simple,containing only13edges.From Fig.3,it appears that the graph breaks up into5disconnected components,corresponding to di erent consumer behaviours.This is similar to what occurs in the exploratory frequentist analysis,choosing a threshold odds ratio of5.
We now describe,from a substantial viewpoint,Fig.3,and compare it with the results obtained from the frequentist approaches.Note that,in the following,all associations are meant to be positive(e.g.with an odds ratio greater than unity) unless when otherwise indicated.
Theÿrst cluster to the left identiÿes an association between milk,cookies and rice,with cookies acting as separators.This same cluster was found also with the frequentist approach.If the association between milk and cookies is clear,less in-terpretable is that with rice products.These are products which are quite stable in sales across the year.
The second cluster to the bottom left identiÿes fruit juices and beer,and was also found in the frequentist approach.Both these products have highly variable sales along the year.
The third cluster to the bottom right associates liquors,mayonnaise,soft drinks and frozenÿsh.These connected components contain products which are quite oc-casionally bought.We remark that the frequentist analysis found a smaller cluster than this,without frozenÿsh being present.540P.Giudici,G.Passerone/Computational Statistics&Data Analysis38(2002)533–541
The fourth cluster to the top right identiÿes,as the previous one,a category of fast-food customers,associating soft cheese and crackers.The frequentist analysis found a biggerclusterthan this,which associated,besides soft cheese and cr acker s, frozenÿsh.
Theÿfth clustercontains6nodes,descr ibing seasonal sales,with multipack ice-creams acting as a separator.The other nodes are takeaway ice creams,tin meat, tomato soup,brandy,olive oil.Tomato soup is negatively associated with multipack, takeaway icecreams and olive oil,re ecting di erent seasons of peak buy.We re-mark that the frequentist analysis discovered a smaller cluster here,containing only the association between the two ice-cream products.
Comparing the Bayesian results with the classical ones,note that there are strong similarities.Both identify the same number of disconnected clusters,which contain similar products.Furthermore,note that most edges contained in Fig.3are also contained in Fig.1.The exceptions are the edges between multipack icecreams and, respectively,olive oil,brandy and tomato soup and,on the other hand,the two edges connecting frozenÿsh to soda and mayonnaise.Of course,Fig.1is obtained, as we have often remarked,taking an approach much more favourable to detect edge presence and,therefore,it contains many more edges than Fig.3.
6.Concluding remarks
In this paper we have concentrated on the problem of model choice for discrete undirected graphical models,with application to modelling association structures in market basket analysis.We haveÿrst proceeded exploratorily,looking at the esti-mated odds ratios.This has allowed us to obtainÿve homogeneous groups which could be analysed separately.The results have been conÿrmed in a following infer-ential analysis,performed using graphical markov models.
Our results show that results on associations present between products can be used to organise the layout of the supermarket and,by means of promotions,to increase sales not only directly,but indirectly,through the promotion of associated products.This also suggests that products heavily associated should never be put simultaneously on promotion.
In the second part of the paper,we have shown that bayesian Markov Chain Monte Carlo techniques can be a useful and valid tool to identify association rules in market basket analysis.Selection of the most likely association structure analysis has been performed directly on the joint226contingency table of the products whereas,in the classical case,this was infeasible,and we had to resort to an approximate solution based on the study of marginal associations.A further advantage of the bayesian analysis is the clarity in the interpretation of the results.The disadvantage is the need to specify a prior distribution and the di culty in implementing a generic software code.
Comparison of the methodologies to our available data show many similarities, with product categories breaking up intoÿve independent clusters,associated within themselves.P.Giudici,G.Passerone/Computational Statistics&Data Analysis38(2002)533–541541 From a subject-matter viewpoint,most of the associations we found were not known a priori,and indeed may not seem very obvious.However,the problem of interest of the providers of this dataset is exactly toÿnd such unknown associations, and this is what makes it a di cult data mining problem.
With this respect,we remark that,if the aim of the inference is heavily focused on quantitative learning aspects we suggest considering,as a more powerful MCMC alternative,the reversible jump approach suggested in Giudici and Green(1999) which does MCMC model determination over both the model and the parameter space and,therefore,allows to extract more inferences on parameters of interest, such as odds ratios.On the other hand,such an algorithm is certainly more di cult to implement and test.
Finally,we remark that the methodology could then be extended to similar cus-tomer behaviour problems,such as forecasting TV and web shares and web mining. Further methodological research is indeed needed,particularly toÿnd ways to speed up practical convergence of the MCMC algorithm.On the other hand,an important direction of applied research is the development of appropriate software code to implement graphical models and MCMC model search,so that they become rou-tinely available.
Acknowledgements
We thank Stefano Paggi of AC Nielsen Italy forhaving pr ovided the data,and Roberto Castelo(University of Amsterdam)for having provided the software for the MCMC bayesian analysis.Finally,the frequentist analysis of the data has been done using SAS System6.12and Enterprise Miner v3:0,forwhich we acknowledge support from SAS Institute Italy.
References
Brooks,S.P.,1998.Markov Chain Monte Carlo Method and its Application.The Statistician47, 69–100.
Giudici,P.,Castelo,R.,2001.Improving Markov Chain Monte Carlo for data mining.Mach.learning, to appear.
Giudici,P.,Green,P.J.,1999.Decomposable graphical gaussian model determination.Biometrika86, 785–801.
Heckerman,D.,Geiger,D.,Chickering,D.,1995.Learning bayesian networks:the combination of knowledge and statistical data.Mach.Learning20,197–243.
Lauritzen,S.,1996.Graphical Models.Oxford University Press,Oxford.
Madigan,D.,York,J.,1995.Bayesian graphical models for discrete data.Int.Stat.Rev.63,215–232. Passerone,G.,Giudici,P.,2000.Modelli associativi per la Market Basket Analysis ed il web mining. Atti del convegno SUGI Italia2000.
Whittaker,J.,1990.Graphical Models in Applied Multivariate Statistics.Wiley,Chichester.
