fuzy data mining
DESCRIPTION
fuzzyTRANSCRIPT
-
Customer Analysis for Software XploRe
From Data Mining to Marketing
Strategy
Diplomarbeit
zur Erlangung des akademischen Grades eines
Master of Science
an der Wirtschaftswissenschaftlichen Fakultat
der Humboldt-Universitat zu Berlin
Eingereicht von
Jianqiu Wang
Am 27. Mai 2003
Matrikel-Nr.: 161426
Prufer: Prof. Dr. Wolfgang Hardle
-
Contents
Abstract 1
Introduction 3
1. Customer analysis 5
1.1 Customer Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Customers Black Box . . . . . . . . . . . . . . . . . . . 5
1.1.2 Consumer buying process . . . . . . . . . . . . . . . . . . 6
1.1.3 Customer behaviour model . . . . . . . . . . . . . . . . . . 8
1.1.4 Factors influencing customer buying behaviour . . . . . . . 10
1.2 Market Segmentation and Profiling . . . . . . . . . . . . . . . . . 12
1.2.1 Market segmentation . . . . . . . . . . . . . . . . . . . . . 13
1.2.2 Customer profiling . . . . . . . . . . . . . . . . . . . . . . 22
1.3 Market targeting and Positioning . . . . . . . . . . . . . . . . . . 23
1.3.1 Market Targeting . . . . . . . . . . . . . . . . . . . . . . . 23
1.3.2 Positioning . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2. Data Mining 26
2.1 The process of Data mining . . . . . . . . . . . . . . . . . . . . . 26
2.1.1 Data Collection and Selection . . . . . . . . . . . . . . . . 26
2.1.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.3 Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.4 Result Interpretation . . . . . . . . . . . . . . . . . . . . . 29
2.2 The Aspects of Data Mining . . . . . . . . . . . . . . . . . . . . . 29
2.2.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.3 Data Mining Techniques . . . . . . . . . . . . . . . . . . . 31
i
-
ii Index of contents
3. XploRe user and customer analysis 39
3.1 About XploRe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 XploRe user(2002) and customer descriptive analysis . . . . . . . 39
3.2.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.2 Data cleaning and preparation . . . . . . . . . . . . . . . . 41
3.2.3 Data descriptive analysis and result . . . . . . . . . . . . . 42
3.2.4 Comparing the user and customer of XploRe . . . . . . . . 46
3.2.5 Measures of Improvement . . . . . . . . . . . . . . . . . . 46
3.3 Cluster analysis for XploRe user data 2002 . . . . . . . . . . . . . 47
3.3.1 Cluster analysis of categorical data . . . . . . . . . . . . . 47
3.3.2 Clustering with IBM intelligent Miner . . . . . . . . . . . 53
3.3.3 Cluster analysis with XploRe . . . . . . . . . . . . . . . . 59
3.3.4 Comparison of Cluster Analysis Results: IBM Intelligent
Miner versus XploRe . . . . . . . . . . . . . . . . . . . . . 63
3.4 Analysis of the latest User data (2003) . . . . . . . . . . . . . . . 63
3.4.1 Results of analysis of 2003 data . . . . . . . . . . . . . . . 63
3.4.2 Comparison of historical user data . . . . . . . . . . . . . 72
3.5 Complementary analysis . . . . . . . . . . . . . . . . . . . . . . . 78
3.5.1 Analysis of regrouped data . . . . . . . . . . . . . . . . . . 78
3.5.2 Analysis of high profitable sector . . . . . . . . . . . . . . 82
4. Suggested marketing strategy for XploRe 85
4.1 Marketing Strategy and Marketing mix . . . . . . . . . . . . . . . 85
4.1.1 marketing strategy . . . . . . . . . . . . . . . . . . . . . . 85
4.1.2 Marketing Mix . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Develop the marketing strategy for XploRe . . . . . . . . . . . . . 91
4.2.1 Niche market strategy . . . . . . . . . . . . . . . . . . . . 92
4.2.2 Target Market . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.2.3 Product position of XploRe:103 . . . . . . . . . . . . . . . . 92
-
Index of contents iii
4.2.4 General XploRe marketing strategy pyramids . . . . . . . 93
4.2.5 General Marketing Mix . . . . . . . . . . . . . . . . . . . . 96
4.2.6 Special marketing mix for clusters . . . . . . . . . . . . . . 101
4.2.7 Marketing research - suggestions for further analysis . . . . 103
References 107
Appendix 116
Appendix 1: User 220702 Frequency Analysis . . . . . . . . . . . . . 117
Appendix 2: Customer Frequency Analysis (Nov. 05) . . . . . . . . . . 120
Appendix 3: Customer Registration form. . . . . . . . . . . . . . . . . 121
Appendix 4: Characteristics of User220702 Clusters by XploRe . . . . . 122
Appendix 5: User 130303 Frequency Analysis . . . . . . . . . . . . . 123
Appendix 6: User 13032003 Intelligent Miner Cluster Analysis . . . . 126
Appendix 7: Comparison of User and Regrouped User Data . . . . . . 128
Appendix 8: User 130303 (Regrouped) Frequency Analysis . . . . . . 129
Appendix 9: Regrouped User Intelligent Miner Cluster Analysis . . . 132
Appendix 10: Institute Users Frequency Analysis . . . . . . . . . . . 134
Erklarung zur Urheberschaft 137
-
iv Index of contents
-
List of Figures
1.1 The customers Black box. . . . . . . . . . . . . . . . . . . . . . 6
1.2 A sequential model of the buying process . . . . . . . . . . . . . . 7
1.3 Consumer Behaviour model. . . . . . . . . . . . . . . . . . . . . . 9
1.4 Factors influencing consumer behaviour. . . . . . . . . . . . . . . 10
1.5 The process of marketing segmentation. . . . . . . . . . . . . . . . 14
1.6 Alternative consumer demand categories. . . . . . . . . . . . . . . 15
1.7 SAGACITY. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.8 Targeting strategies. . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1 Sample of online survey questionnaire. . . . . . . . . . . . . . . . 40
3.2 Clustering of Users 2002. . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Clustering of user 2003. . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4 Software used in 2000 and 2003. . . . . . . . . . . . . . . . . . . . 74
3.5 Information resource in 2000 and 2003. . . . . . . . . . . . . . . . 75
3.6 Clustering of regrouped user data. . . . . . . . . . . . . . . . . . . 81
4.1 4P of marketing mix . . . . . . . . . . . . . . . . . . . . . . . . . 86
v
-
vi Index of contents
-
List of Tables
1.1 Broad- based ACORN classifications 23 . . . . . . . . . . . . . . . 18
1.2 National readership survey socio-economic groups 24 . . . . . . . . 19
2.1 The aspects of data mining . . . . . . . . . . . . . . . . . . . . . . 30
3.1 Summary and decription of the varibale of User 22/07/02 data . . 44
3.2 Summary and descripiton of the variables for customer data . . . 45
3.3 Comparison of XlopRes Users and Customers . . . . . . . . . . . 47
3.4 Character characteristics of User IBM Intelligent Miner Clusters
(2002) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5 Comparison of Clustering results with IBM Intelligent Miner and
XploRe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.6 Summary and description of the variables for User data 2003 . . . 65
3.7 Comparison of User 220702 and User 130303 . . . . . . . . . . . 72
3.8 Comparison of software used in 2000 and 2003 . . . . . . . . . . . 73
3.9 Comparison of information resources in 2000 and 2003 . . . . . . 74
3.10 Comparison of country in 2000 and 2003 . . . . . . . . . . . . . . 76
3.11 Comparison of continent in 2000 and 2003 . . . . . . . . . . . . . 76
3.12 Comparison of User clusters of 2000 and 2003 . . . . . . . . . . . 77
3.13 Summary and description of the variables of regrouped User data
2003 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.14 Comparison of Institute user and General user . . . . . . . . . . 84
vii
-
viii Index of contents
-
Abstract
This thesis paper presents a case study of customer analysis with the purpose
of to developing a marketing strategy for the statistical software XploRe. The
customers analysed include the users, who downloaded XploRe free trial version
through web site and the actual customers, who bought XploRe. Descriptive
analysis was conducted for both data, which leaded to the conclusion that re-
search institutes represent is the high- profit able sector for of XploRe. For users
data, data mining method clustering was undertaken to identify the customer
segments. Two different clustering methods were tested on the same users data
set with different software IBM Intelligent Miner and XploRe. As the a result,
the users of XploRe were divided into four clusters by both methods, Internet
surfer,Academia, Linux user and Home worker. Through the comparison
of historical data for of user data 2003 and data 20020, more facts and trends
of XploRe market and customers were discovered regarding the software used,
information resource, new market and the undergoing changes in customer seg-
ments. Based on the results of customer analysis, the suggestions for marketing
strategy, marketing mix and further analysis were outlined.
Key words: customer analysis, market segmentation, data mining, clustering,
marketing strategy, marketing mix
1
-
2 Abstract
-
Introduction
Customer analysis is a crucial step for the development of marketing strategy.
Only when the company has a clear view of its customers could , the proper
strategy and actions could then be undertaken to gain competitive advantage in
the market.
In the current time, together with the development of digital data management
systems, the capability for of gathering, storing and accessing to the information
has improved dramatically. This trend brings the difficulty for companies when
they confront the huge amount of data. Data mining is a important technology
for the companies to conduct customer analysis for large data set. It discoveries
valuable information which is useful for marketing.
The research presented in this paper tried to segment the customers and find
the trends and facts of XploRe market, so that the suggestions for marketing
strategy could be derived based on the results. XploRe is a statistical software
which aims at sophisticated users who are looking for a flexible, programmable
statistics package with an emphasis on more advanced procedures.1 It is impor-
tant for XploRe marketer to understand its customer and market. The customer
data studied here include the data of XploRe users (the potential customer) and
actual customers (the buyers). The user data was collected through an online
questionnaire preceding the downloading process of XploRe trial version, while
through the returned registration forms the customer data was gathered. With
the purpose of comparison, two sets of user data were analysed and two cluster-
ing methods were tested with two software IBM Intelligent Miner and XploRe.
The user data 2002 is from October 11, 2001 to July 22, 2002 and with 1734
profiles. The raw data of user data 2003 contains 2593 profiles and is collected
from October 11, 2002 to March 13, 2003. The customer data includes data of
32 profiles from July 1, 2000 to August 30, 2002.
Only descriptive analysis was taken for customer data due to its low amount
of records. For user data, the data mining process of clustering was conducted
to segment the market. The mining run for user data consists of several steps:
cleaning the raw data with MS Excel, transferring data to IBM Intelligent Miner
or XploRe, performing cluster analysis. The clustering identified four groups
of XploRe customers, namely Internet surfer, Academia, Linux user and
1Hardle, Klinke and Muller, 1999, P17.
3
-
4 Introduction
Home worker. Each cluster possesses its distinguishable features.
The comparison of customer and user 2002 leaded to the discovery of high prof-
itable sector research institute. XploRe and IBM Intelligent Miner (IM) delivered
similar clustering results for user data, but IM performed better in visualisation
and computational efficiency. Comparing the results of historical data between
user data 2003 and user data 2000, some trends were identified. More professional
users switched to command driven software. XploRe made progress in commu-
nicational channels. Asia, especially Japan emerged as new market. From the
aspects of segments, Internet surfer is a brand-new group in 2003, which indicates
the entering of Internet age. The appearance of Home worker in 2003 instead of
Researcher in 2002 gives hint in the problem in the survey questionnaire. More
Academia take non-personal channels to get information. This again confirms
the improvement made by XploRe in communication channels. Linux users were
very stable during the period.
Based on the findings of analysis, some suggestions for marketing strategy and
further analysis were made for XploRe marketer.
This paper consists of mainly four parts. The first two sections following the
introduction lay the theoretical foundation for the customer analysis and data
mining. Section three is presents engaged for the analysis and results. Marketing
strategy and suggestions are developed in the fourth section. At the end, the
summary gives a brief overview for the whole paper.
-
1. Customer analysis
In the current market space, the competition is intensive. The market is abundant
with all kinds of products. To win the decision of customers to their products, the
companies should get a deep sight into what the customers really need and how to
influence their purchasing e decision. Therefore, the companies should now have
a customer focus conducting business with the emphasis on the understanding
of the customers and the market.
Customer analysis is the study of customers and their behaviour, which is central
to achieve a customer focus. 2 The purpose of conducting customer analysis is
to achieve marketing goals, such as the following: 3
Customer acquisition finding the new customer
Customer cross sell further sales of different products to the same customer
Customer up sell the customer makes greater use of the same product orservice
Customer retention keeping the customer loyal
1.1 Customer Behaviour
In order to understand the customer buying behaviour, we should first understand
the customer behaviour.
1.1.1 Customers Black Box
Customer behaviour here means that the behaviour of individuals who purchase
for private or household consumption. These customers buy goods which are not
a part of the value chain, and the purpose of purchasing is not to generate profit.
Buying behaviour depends on the individual reaction to the internal and external
stimuli; therefore, it is difficult to predict. Black box is the item that describes
2WWW143Heygate, Richard, 1998.
5
-
6 1. Customer analysis
the customer purchasing decision, which is difficult to access but is crucial for the
purchasing determination.
In order to develop appropriate products that are attractive to the customers,
firms need to have an insight into what happens in the black box. Figure ??
presents the customers black box. In the customers black box, the customer
actually gather information, evaluate and compare, then come to a decision, which
is called the Consumer buying process.
Blackbox
-Identificationofneeds-Evaluationofoffers
thatSatisfyneed-Comparsionofsubstitute
productsandbrands-Purchase-Post-purchaseevaluation
AspirationsMotivationEducationPersonalityBeliefs
Externalstimuli
-Socialpressure-Legalrequirments-Physicalfactors-Economiccycle
Consumer
People Place - Promotion -- -- Product Price Process Physicalenvironment
Marketer
7Ps
Fig. 1.1: The customers Black box.
1.1.2 Consumer buying process
Buying decision process
The buying process starts with the customers desire of a product. This want
might be the result of internal stimuli like hunger and thirsty or the result of
external stimuli, such as advertisement.
Next step is the search for information. The consumers may collect information
consciously or unconsciously from various resources. There are four kinds of
information resources:
1. Personal sources such as family, friends, colleagues and neighbours;
3Bannes, E., McClelland, B.,etc., 1997, P139.
-
1. Customer analysis 7
Recognitionof
theproblem
Thesearchfor
information
Evaluationofthe
alternatives
Thepurchase
decision
Post-puchase
behaviour
Fig. 1.2: A sequential model of the buying process
2. Public sources such as the mass media and consumers organisation;
3. Commercial sources such as advertising, sales staff and brochures;
4. Experimental sources such as handling or trying the product.
Through information gathering, the customers get aware become aware of the var-
ious products and brands in the market, then they will evaluate the alternatives,
and finally make the purchase decision.
After purchasing major items or expenditure, many people experience cognitive
dissonance also called post purchase anxiety. They wonder whether they have
made the correct purchasing decision. To reduce this anxiety, they will look for
confirmation. For example, they might ask friends to approve that their purchase
is a right choice.
Figure 1.2 summarises the stages of consumer buying process: Recognition of the
problem, The search for information, Evaluation of the alternatives, The purchase
decision and Post-purchase behaviour.
Companies should present themselves in each buying process stage and try to
be distinguished among all other products and brands of competitors. To let
a brand or product be the final choice of customer, companies need to have
clear understanding of the evaluative criteria used by consumers in comparing
products, which was mentioned before.
3Wilson, R. W. S. and Gilligan, C., P170.
-
8 1. Customer analysis
Five buying roles
The purchase process normally involves several persons, each has his distinct role.
Each role doesnt necessarily require to be the a different person. One person can
play several roles in a purchasing process.
The five roles in a purchasing process are:
The Initiator: The person who suggests buying the product or service.
The influencer: Person whose comments can affect the decision of purchas-ing.
The decider: The person who decide whether to buy and which product tobuy.
The buyer: Who executes the purchase.
The user: The final consumer of the product or service.
For example, a mother buys ice cream for her child. The child is the user; the
mother is the decider and buyer. The company should understand the function
that each role plays in the buying process in order to put effective influence on
customers buying decision through proper action.
1.1.3 Customer behaviour model
The customer behaviour model indicates the procedure and basic elements, which
happens inside the customers black box or consumer buying process.
The most basic, simplest and best known model of buyer behaviour is the AIDA,
which stands for Awareness, Interest, Desire and Action.4
The model introduced here composes of six interrelated components.5
1. Information or facts: refers to the precept caused by stimulus.
2. Product recognition defines to what the extent the buyer knows about the
product to distinguish it from others products.
4Baker, M. and hart, S., 1999, P63.5Howard, J. A., 1994, P31-56.
-
1. Customer analysis 9
F RI P
A
C
Fig. 1.3: Consumer Behaviour model.
3. Attitude towards the product refers to what the customer expects from the
product to satisfy their particular needs.
4. Confindence in judging the product is the customers degree of certainty that
his or her evaluative judgement of a product is correct.
5. Intention to buy is the mental state that reflects the customers plan to buy
some specific number of products from a particular brand in some specified
time period.
6. Purchase is caused by the intention to buy. It is defined as when the cus-
tomer has paid for a product or has made some financial commitment to
buy some specified amount during some specified time period.
F- Information R- product recognition C-Confidence A-Attitude I-Intention P-
Purchase
When consumers evaluate a product, they also employ certain evaluative criteria,
which have several aspects:
1. The products attributes such as its price, performance, quality, and styling.
2. Their relatively importance to the consumer.
3. The consumers perception of each brands image.
4. The consumers utility function for each of the attributes.
These evaluative criteria come cross with the elements in the consumer behaviour
model. For instance, product recognition, attitude towards the product and con-
fidence in judgement are the three parts in the buyers image of a product. They
all have vital impact on the consumers buying decision.
-
10 1. Customer analysis
CultureSub-cultureSocialclass
EconomiccycleSocialpressureLegalrequirementNewtechnology
ReferencegroupsFamilyRolesandstatus
Thebuyer
CulturalEnvironmental
Social
Psychological
MotivationLearningPerceptionBeliefsandattitudes
PersonalAgeandlifecyclestageOccupationEconomiccircumstanceLifestyleandpersonality
Fig. 1.4: Factors influencing consumer behaviour.
1.1.4 Factors influencing customer buying behaviour
Various factors influence customer buying behaviour. Generally we could put
them into five categories: Psychological factors, Cultural factors, Social factors,
Personal factors and Environmental factors. 6 78
1.Psychologicalfactors
Human needs include the basic needs, like shelta, food and drink, and higher
level needs, such as friendship and achievement. People purchase goods to satisfy
their needs. The purchasing behaviour can be considered as the result of internal
and external stimuli.
Maslow (1943) has suggested that behaviour can explained by a hierarchy of
needs. He grouped peoples needs into five levels and argued that when a person is
satisfied with one level of needs, he will strive for another level of needs. Maslows
five levels of needs are Physiological needs, Safety needs, Social needs, Esteem
needs and Self-actualisation needs.9
Physiological needs are the basic needs for human being to survival, such as food
and drink. Only after these needs are satisfied, the other level of needs will be
6WWW117Bannes, E., etc., 1997, P139-149.8Environmental factors are external factors, while the other four factor categories are internal
factors that influence consumer buying behaviour.9Bannes, E., Mcclelland, B., etc., 1997, P139-184.
-
1. Customer analysis 11
desired.
Safety needs refers to peoples needs for security, stability and predictability. Ser-
vices, such as insurance, guarantees, etc. are the products to satisfy humans
safety needs.
Social needs explain the humans desire of love and sense of belonging. At this
level, people will seek to join association and clubs.
Self-actualisation is the highest level of needs. It demonstrates itself in the search
of status, esteem, achievement and recognition. To satisfy this level of needs,
people turn to the luxurious products, like perfumes, high-tech products, cars,
etc..
Only after people achieve all these level of needs, they will then turn to the
realisation of their potential, which is expressed in concern for external issue, like
volunteer work.
2. Personal factors
Personal factors are the set of buyers personal characteristics, including age,
occupation, lifestyle, personality, and economic circumstances.
3. Cultural factors
Culture factors include culture, sub-culture and social class.
Culture is a set of shared values, which define peoples behaviour. Language is
the best example of culture difference. Not rightly using a language will cause
misunderstanding. And also there are attitude differences between eastern and
western culture towards family and individual.
A large society or culture is normally divided into subculture groups, which define
more subtle behaviour norms. Subculture groups include ethnic groups, religious
groups, racial groups and geographical groups etc.. They exhibit the difference
in culture preference, ethnic taste, attitudes, life style and taboos.
Social class is also called socio-economic group. It is decided by the income level,
education and occupation. The often-used social class model divides the society
into upper class, upper middle class, lower class, upper working class, working
class and others.
4. Social factors
Social factors includes reference groups, family, social role and status.
Reference groups are defined as all groups that have a direct (face-to- face) or
-
12 1. Customer analysis
indirect influence on the persons attitude or behaviour.10 Reference groups can
be divided into four types.
1. Primary membership groups are generally informal, and interact within the
members, such as family, neighbours, colleagues and friends.
2. Secondary membership groups are more formal than primary memberships,
and the interactions between members are less. These include religious
groups, professional groups, trade unions.
3. Aspirational groups are groups that one would like to belong to.
4. Dissociating groups are groups, whose values and behaviour are rejected by
the individual.
5. Environmental factors
Environmental factors consist of economic, social, political, technological aspects.
Economic cycle, social pressure, legal requirements, new technology all will influ-
ence consumers purchase decision on which product to buy and the way to buy
it.
1.2 Market Segmentation and Profiling
When firms try to sell their products in customer markets, they should not only
try to identify the factors that influence the customers black box, but also to
estimate whether there is enough number of customers who need their offer. It
is important for the companies to compare their capabilities and the objectives
of customers, so that they can decide whether they are able to serve the market
with appropriate products profitably. Therefore, firms must identify market need,
segment the total customer into potential customer groups, which are likely and
able to purchase the offer, and also position the product or service as attractive
alternative to other offers of the target groups.
10Wilson, Gilligan and Person, 1994, P160.
-
1. Customer analysis 13
1.2.1 Market segmentation
Market segmentation is the subdivision of a market into distinct subsets of
customers, where any subsets may conceivably be selected as a target market to
be reached with a distinct marketing mix.11
Market segmentation is inspired by Kotlers Targeting marketing. As Kotler
said, that in target marketing, the seller distinguishes the major market seg-
ments, targets one or more of these segments, and develops products and services
tailored to each selected segments. 12
Because each individual has different preference, characteristics, taste and inter-
est, their buying behaviour patterns are various and heterogeneous, it is almost
impossible or unprofitable for a company or single product to serve all of the
needs. Furthermore, the communication of marketing mix to a non-homogenous
group will also be inefficient. Therefore, the companies search for the groups
with attractive attribute, then concentrate on them to develop specific products,
services and to utilise specific marketing resources to gain the maximal market
return.
Segmentation identifies the subsets of buyers who share the similar needs and
demonstrate the similar buying behaviour. It subdivides a heterogeneous total
customer market into smaller, manageable and homogenous clusters by criteria.
The similar patterns of buyers needs and buying behaviour, which are identifiable
and relevant to the buying decision, exist in each cluster.
Customer segmentation brings major benefits to the companies:13
EfficiencyBecause the customers are subdivided, companies could only focus on the
interested markets. Therefore, they could allocate and utilise their resources
more efficiently.
EffectivenessThrough segmentation, the needs of each customer segments could be bet-
ter identified and examined. Thus, the understanding and awareness of the
customer needs could be enhanced. The companies could tailor their prod-
ucts and marketing measures to meet customer needs more effectively. Due
11Kotler, 1995, p286.12Kotler, 1991, P262.13WWW29.
-
14 1. Customer analysis
Definingthemarket
Selectingthebaseforsegmentation
Dividingthemarketandprofiling
Fig. 1.5: The process of marketing segmentation.
to the improved marketing effectiveness, the response rate of customer will
also increase, thus, the return and profit from marketing investment will
also be improved.
New MarketSegmentation could help companies to identify the new market opportu-
nities. The needs and characteristic of the total customer /market are so
various diverse that some unique feature of a small group are not distin-
guishable. After segmentation, company could discover those markets with
unique features. They could offer the valuable opportunities for companies
to enter new markets.
The process of market segmentation14
The process of market segmentation is composed of three steps.
1. Defining the market
The total market for a product or service comprise oses all of the consumers who
14Bannes, E., McClelland, B., and Meyer, R, 1997, P181-185.
-
1. Customer analysis 15
HomogeneousdemandConsumershaverelativelysimilarneedsordesiresforaproductorservicecategory
Diffuseddemand
Consumersneedsanddesiresaresodiversthatnoclearclusters(segments)canbeidentified
Clustereddemand
Consumersneedsanddesirescanbegroupedintotwoormoreidenitifiableclusters(segments),eachwithitsownsetofpurchasecriteria
Fig. 1.6: Alternative consumer demand categories.
desire or potentially desire it, and willing to and able to buy it. It is necessary
to analyse the market in terms of its size and pattern of demand.
There are three patterns of demand categories: 15
1. Homogeneous demand
All consumers in a market have similar needs and wants.
2. Diffused demand
Consumers needs are diverse and no clear segments can be identified. This
suggests the need for customisation.
3. Clustered demand
Consumers need and desires can be grouped into several identifiable seg-
ments. Each has its own set of purchase criteria.
2. Selecting the approach and bases for segmentation
Identification of market segmentation could be conducted based on detailed mar-
ket research, or on basic analysis of customer data held within a company. Many
companies keep customer records detailing information such as age and gender.
15Bannes, E., McClelland, B, etc. , P181-183.
-
16 1. Customer analysis
There are generally two types of methods for of market segmentation.16 17
1. A Priori methods:
In a prior approach, the basis for segmentation is set in advance. The primary
market research is not necessary. Thus, the analysis of second data resources,
the customer information at hand, manger intuition and other methods will be
employed to set the segmentation basis for the buyers according to their usage
patterns (heavy, medium, light and non-user), demographic characteristics (age,
sex, income) or psychographic profiles (personality). After the basis setting, a
research will be conducted to identify the size, location and potential of each
segment. The marketing decision will be based on which segment the marketing
efforts should be concentrated. For example, classification is a prior approach.
2. Post hoc methods:
Post hoc approach segments the market depending on the research finding, rather
than decides the segmentation basis in advance. The primary market research is
conducted to collect the classification and descriptor variables. Segments will be
defined only after all the relevant information is collected and analysed. The re-
search might highlight the particular attributes, attitudes or benefits, with which
particular groups of customers are concerned. The result then becomes the basis
for dividing the market.
3. Dividing the market and profiling the segments
Based on the data gathered, the process of dividing the market into identifiable
market segments is carried out. The information obtained will give details re-
garding to the nature of customer segments. This is called segment profiling.
Profiling associates tapes each segment with certain characteristics, and aggre-
gates the customer with similar characteristics into group and separates them
from those with different characteristics.
Criteria of customer segmentation
A market could be segmented in various ways. There are problems with segmen-
tation, such as the relevance and quality of the data, intuition, continuous process
16WWW3117Han, J. and Kamber, M, 2001, P281-319.
-
1. Customer analysis 17
and over-segmentation. A good segmentation should be relevant for buying be-
haviour and satisfy the following requirements:18 19
Size: the market should be big enough to guaranty a good segmentation.It is dangerous to over segment an already very small market.
Difference: the difference between the member of the segments should existand could be measured through data collection approach.
Measurability: The company is able to collect information that measuresthe nature of buying behaviour for the segmentation.
Substantiality: The selected segmentation should be profitable regarding tothe marketing mix resources designed especially for it.
Accessibility: The extend that the marketing effort could reach the segmen-tation.
Stability over time: The segmentation should last a certain period withoutdramatic change in major features.
Responsive to communication means: The segmentation sensitive to themarketing mix and communication means.
Variables for customer segmentation
Almost all factors which affect customers buying process and decision can be
used as the variables of customer segmentation. Generally the variables for
customer segmentation can be put into five categories: Demographic, Socio-
economic Grade, Psychographics and life style, Behavioural, Geographic and
Geo-demographics. 20 21
1. Demographic variables
Demographic variables categorise the market according to the population char-
acteristics and population profiles. Customers are subdivided into groups based
on one or more demographic variables such as age, sex, religion, race, nationality,
family size and stage of family life cycle. For example, the custom seller groups
18WWW2019Wilson, R. and Gilligan, C., 1997, P275.20Kalakota, R. and Whinston A. B..21McDonald M. and Dunbar I., P85-91.
-
18 1. Customer analysis
ACORN Group 1981
Population %
A Agricultural areas 1, 811, 485 4.3
B Modern family housing, higher incomes 8, 667, 137 16.2
C Older housing of intermediate status 9, 420, 477 17.6
D Older terraced housing 2, 320, 846 4.3
E Better - off council estates 6, 976, 570 13.0
F Less well-off council estates 5, 032, 657 9.4
G Poorest council estates 4, 048, 658 7.6
H Multi-racial areas 2, 086, 026 3.9
I High-status non-family areas 2, 248, 207 4.2
J Auent suburban housing 8, 514, 878 15.9
K Better-off retirement areas 2, 041, 338 3.8
U Unclassified 388, 632 0.7
Tab. 1.1: Broad- based ACORN classifications 23
customer regarding their ages. Like age of 20-30, this group are the customers,
who are more like to purchase trendy items.
2. Geographic and Geo-demographics
Geographic segmentation divides the market into different geographic units such
as countries, regions, counties, cities and postcode etc. Geographic system is
based on the proposition that the neighbourhood area in which you live will
be reflected in your professional status, income, life stage and behaviour. The
neighbourhood types are initially identified using national census data.
ACORN (A Classification of Residential Nneighbourhoods) is an example of ge-
ographic systems. ACORN classifies consumers into 43 demographic and be-
haviourally distinct clusters. The clusters are based on the type of neighbourhood,
socio-economics status and the buying behaviour and preference.22 A Broad-
based ACON classification is conducted in Great Britain in 1981. It segments
the residents in Great Britain into 12 categories.
3. Socio-economic Grade
The buying behaviour is often influenced by the social class of a person The
factors include income, status, education etc. National Readership Survey scales
22Kurs, M., Ryan, B., Lamb, G. etc., 2001.23Bannes, E., McClelland, etc., 1997, P201.
-
1. Customer analysis 19
Grade Social Classification Occupation
A Upper Middle Class Higher managerial, professional or administrative jobs
B Middle Class Middle managerial, professional or
C1 Lower middle class Supervisory or clerical jobs, Junior management
C2 Skilled working class Skilled manual workers
D Working class Unskilled and semi-skilled manual workers
E Subsistence level Pensioners, unemployed, casual or low grade workers
Tab. 1.2: National readership survey socio-economic groups 24
is one of the popular classifications, which and is based on the occupation of the
main wage earner of the household.
A further development of the life stages socio-economic grade model is SAGAC-
ITY, developed by Research Services Ltd.. This model combines life stages with
income and social class.
4. Psychographic variables
Psychographics attempts to classify individuals by their attitudes, personality
and life styles.
(1)Personality
Personality is used as variable to segment the market. The earliest segmentation
was conducted by Riesman et al (1950) in early 1950s. It identified three distinct
types of social characterisation and behaviour: 25
1. Traditional directed behaviour, which changes little over time and which as
a result, is easy to predict and is used as a basis for segmentation.
2. Other directness, in which the individual attempts to fit in and adapt to
the behaviour of the peer group.
3. Inner directness, where the individuals is seemingly indifferent to the be-
haviour of others.
(2) Attitude
Attitude includes the customers attitudes towards risk, degree of loyalty, the
24Kurs, M., Ryan, B., Lamb, G. etc., 200124Blois Keith, 2000, P389.25Wilson, Gilligan and Pearson, 1994, P291
-
20 1. Customer analysis
LifeCycle Income Occupation
Family
Late
Pre-family
Dependent
Betteroff
Betteroff
Worseoff
Worseoff
White-collar
White-collar
White-collar
White-collar
White-collar
White-collar
Blue-collar
Blue-collar
Blue-collar
Blue-collar
Blue-collar
Blue-collar
Fig. 1.7: SAGACITY.
-
1. Customer analysis 21
likelyhood of taking new products, etc. Many of the personality variables could
also use as the descriptor of the attitude.
(3) Lifestyle
The consumers behaviour is determined by the way we live our lives as well. It
arises from a complex relationship between our aspirations, surest situation, and
perception of self, income and attitudes. Life style market segmentation offers a
detailed view of buyers because it composes of numerous characteristics related
to their activities, interests and opinions. The life style consist mainly of three
dimensions: 26
1. Activities: Work, hobbies, social events, vacations, entertainment, club,
membership, community, shopping, sports.
2. Interests: Family, home, job, community, recreation, fashion, food, media,
and achievements.
3. Opinions: Selves, social issues, politics, business, economics, education,
products, future, culture.
5. Behavioural variables
(1) Benefit sought variables
This group of variables for segmenting customer considers the motive for a pur-
chase. It groups consumers according to specific benefits that they seek in a
product. Even if two customers bought exactly the same products, the benefit
they expected may vary. Benefit segmentation is therefore based on behaviour
processes, involving thought and action, as opposed to age and socio-economic
class, which are defined according to individual characteristics. It closely identi-
fies the customers needs and represents a powerful method of understanding and
influencing behaviour.
In applying for this approach, a company should begins by attempting to measure
consumers value systems and their perceptions of various brands within a given
product class. The information gathered is then used as the basis of marketing
segmentation. Benefiting segmentation begins by determining the principal ben-
efits that the customers are seeking in the product, the kinds of people who look
for each benefit and the benefit delivered by each brand. For example, for teeth
26McDonald, M. and Dunbar, I., 2000, P89.
-
22 1. Customer analysis
paste market, four segments are identified according to benefit: Seeking economy,
Decay prevention, Cosmetic and Taste benefits.
(2) User status
The market can be divided into five segments, according to user status: non-
users, ex-users, potential users, first-time users and regular users. First-time user
and potential users can be further subdivided on the basis of usage rate.
(3) Loyalty Status and Brand Enthusiasm
Loyalty status categorises the customers on the basis of the extent and depth
of their loyalty to particular brands or products. Most typically there are four
categories: Hard core loyals, soft-core loyals, shifting loyals and switchers.27
1. Hard core loyals are customers who consistently buy the same brands or
product.
2. Soft-core loyals are those who are willing to choose from a limited brand
set. Their Loyalty is divided among the limited brands or products.
3. Shifting loyals consists of consumers who shift their loyalty from one brand
to another. After they shift the brand, they will not buy the ex-brand any
more.
4. Switcher loyals are those who show no loyalty to any single brand. Their
buying pattern is typically determined either by the special offers available
or by their search for variety.
(4) Critical events
Major or critical events generate ones needs, which can be satisfied by the pro-
vision of a special collection of products and/or services. Typical examples are
marriage, the death of someone in the family, unemployment, illness, retirement
and moving house, etc..
1.2.2 Customer profiling
Customer segmentation and customer profiling are two elements of Customer Re-
lationship Management (CRM). Customer Profiling is performed after customer
segmentation. Customer Profiling is to locate clusters within the customer file
that outperform the average.28 It creates customer segment profile, which labels
27Wilson, Gilligan and Pearson, 1994, P291.28WWW18
-
1. Customer analysis 23
the customers with their attributes.
Identifying the characteristic of the customers helps the company to decide which
segments will respondse best to their marketing effort. When companies get
clearer overview about the attributes and demands of the customer segments,
they could then decide what action and what resource should be taken and located
to the selected customer segments. Furthermore, according to pre-built models,
customer profiling can also be used to find potential customers and delete inactive
or bad customers.
The profiling attributes are similar as the segmentation attributes. For example,
the profiling attributes include: Geographic, Cultural and e and ethnic, Economic
conditions (Incomes and /or purchasing power), Age, Values, attributes, beliefs,
Lifestyle Knowledge and awareness, Lifestyle, Media, Recruitment method. For
acquired customer, the variable of customer behaviour could also be employed as
profiling variables, such as shopping frequency, complaining, frequency, satisfied
degree of satisfaction and preferences, etc.
1.3 Market targeting and Positioning
1.3.1 Market Targeting
The next task after customer segmentation and profiling is market targeting.
Companies choose one segment or several segments as the target market. The
target market is the market that company decides to serve. Specific marketing
mix and resources will be developed to serve the target market.
The companies normally adopts on e of the three targeting strategies:29
Undifferentiated strategy: Company ignores the difference between each cus-tomer segments, and regards the whole market as a single market. Single
marketing mix is adopted for the whole market. This is the so called mass
marketing.
Differentiated strategy: The whole market is divided into several segments.The company develops different marketing mix for different segments.
28Keith Blois, 2000, P398.29Amstrong, G.and Kotler, P., 2002, P255-258.
-
24 1. Customer analysis
DifferentiatedStrategy
ConcentratedStrategy
UndifferentiatedStrategy
Organisation
Organisation
Organisation
MarketingMix
MarketingMix
MarketingMix1
MarketingMix2
MarketingMix3
Segment1
Segment1
Segment2Segment3
Segment3Segment2
Entiremarket
Fig. 1.8: Targeting strategies.
Concentrated strategy: The company chooses one or several market seg-ments, but only take the single marketing mix. Under this strategy, the
company tries to have a high market share in one or several niches markets,
instead of struggling to have a small share in the whole market. For the
firms with limited resource, this strategy is very appealing.
1.3.2 Positioning
The purpose of target marketing is to focus on the selected target market, fine-
tune the market mix to provide a group of potential customers with superior
value, therefore, to build up unique position of product in the customers view.
A products position is the complex set of perceptions, impressions, and feeling
that it induces in consumers, compared with competing products.30 Positioning
refers to the how customer think about proposed and /or present brands in a mar-
ket. 31The fundamental idea of positioning is competitive advantage. 32Through
30Bannes, McClelland, Meyer and Wiesehofer, 1997, P230.31WWW3332WWW30
-
1. Customer analysis 25
the differentiated market mix, the special needs and demands of customers could
be satisfied. Thus, the customers will view the product or brand as superior to
the others, and place the product or brand with a distinct position. To position
a product, the marketer must appeal to the target customers strongly with its
strength and differences using proper marketing mix.
-
2. Data Mining
Data mining, which is also known as Knowledge Discovery in Database KDD,33
is a powerful new technology, which help company to identify the important
information among the sea of data. Data mining technology is commonly used
for customer analysis.
Fayyad defined data mining as a non-trivial process aimed at identifying, valid,
novel, potentially useful and ultimately understandable pattern in data.34 While
Grameier and Rudolph consider data mining in terms of all methods and tech-
niques, which allow to analyse very large data sets to exact and discover previ-
ously unknown structures and relations out of such huge heaps of details. These
information is filtered, prepared and classified so that it will be a valuable aid for
decisions and strategies.35
Data mining extract the implicit, previous unknown and potentially useful data
from the data in order to automate the process of discovering the significant
pattern and trends.
2.1 The process of Data mining
The process of data mining could be summarised in as the four stages: Data col-
lection and selection, Data preparation, Data mining, and Result interpretation.36
37
2.1.1 Data Collection and Selection
The Ways of data collection include:
In-house customer database: Companies normally keep records of cus-tomers. The information of customer could be gathered from mailing list,
receipt, memberships, warranty registrations, etc.
33Kotala, P., Perera, A., Kai Zhou, J.,ect.34Fayyad, U., Piatetsky-Shapiro, G. et. al., P6.35Grameier, J., and Rudolph A..36IBMs Data Mining Technology, 199637Bounsaythip, C. and Rinta-Runsala, E., 2001
26
-
2. Data Mining 27
External resource: There are resources, from which one could obtain infor-mation such as demographic information.
Research survey: The often-used way to collect particular information isto conduct a survey. The survey could be conducted through face-to-face
interview, telephone interview, and postal questionnaire or via Internet.
During the collection of data, two types of variables should be collected:38 Clas-
sification Variables classify the data set into groups. Most demographic, geo-
graphic, psychographic or behavioural variable can be used to classify customer
into segments.
Demographic variables: Age, gender, income, ethnicity, marital status, ed-ucation, occupation, household size, length of residence, type of residence,
etc.
Geographic variables: City, state, zip code, census tract, county, region,metropolitan or rural location, population density, climate, etc.
Psychographic variables: Attitudes, lifestyle, hobbies, risk aversion, per-sonality traits, leadership traits, magazines read, television programmes
watched, etc.
Behavioural variables: Brand loyalty, usage level, benefits sought, distribu-tion channels used, reaction to marketing factors, etc.
Descriptor variables are variables used to describe and distinguish each sub-
group from each other in a data set. We could say that the descriptor variables
stand for the characteristic of the represented data set. Descriptor variables must
be easily obtainable variables that already exist in or appended to the customer
files. Many classification variables could be used as descriptor variables.
The data is normally stored in a data warehouse. As the data warehouse contains
all diverse types of data, so that to conducting data mining, the data that will
be used in analysis should be selected in the first step.
38WWW7
-
28 2. Data Mining
2.1.2 Data Preparation
Before data can be analysed, the original collected data must be prepared first
prepared in order make to let it suitable for the analysis. Data preparation
consists of the following stages:
1. Data cleaning:
Check out abnormal, out of bounds or ambiguous items.
Strip out unwanted fields or items. Some attributes are useless for analysispurpose, such as version numbers, email address, etc.
Resolve inconsistent data formats, data encoding, geographical spellings,abbreviations and punctuation
2. Data description
Supply meta data such as row or value counts or variables
3. Data Transformation:
Convert string variables into numeral or numeric categorical variables, orinterpreting or replacing codes into text.
Check missing values. Delete or replace them by default values.
Add computed field as input or target.
Combine data from multiple sources under a common code.
Identify Find out multiple used fields that are multiple times.
Convert continuous variable into category variable for some methods.
Convert nominal data into metric data.
-
2. Data Mining 29
4. Data Sampling39
Required for training or model building
5. Data pruning
Identify dependent, independent and correlated columns or variables
2.1.3 Mining
At the mining stage, various techniques could be used to extract the valuable in-
formation from the final prepared data. For example: To create an accurate, sym-
bolic classification model to predict whether a reader will continue to subscribe
for a newspaper. First, clustering technique should be conducted to segment
the subscribers database; then, the rule is introduced to create a classification
model automatically for each desired cluster, through which one could predict
the behaviour of a customer.
2.1.4 Result Interpretation
Result interpretation is not only to visualise (graphically or logically) the output
of data mining, but also to filter the information and identify the most valuable
and proper result, which will help in the decision making. If the interpreted result
is not satisfactory, the data mining stage or even the whole data mining procedure
should be repeated. The final extracted information must be comprehensible.
2.2 The Aspects of Data Mining
Data mining could be distinguished between the aspects of applications, opera-
tions, techniques and algorithms.40 41
39Ferguson, Mike40WWW 441IBMs Data Mining Technology, 1996
-
30 2. Data Mining
Applications Database marketing
Customer segmentation
Customer retention
Fraud detection
Credit checking
Web site analysis
Operations Prediction and classification modelling
Link analysis
Database segmentation
Deviation detection
Techniques Supervised Induction
Clustering
Association discovery
Sequence discovery
Tab. 2.1: The aspects of data mining
2.2.1 Applications
Data mining is widely used in customer analysis and marketing. The following
areas cover the main application of data mining.42
Customer segmentation: Data mining tools automate the process of find pre-
dictive information in large database. The companies, especially the retailers,
banks, are interested in knowing if there are sub-group customers who exhibit
certain characteristics. They could use data mining to clustering the customers,
discover interested groups. For example, companies use data mining to analyse
the historical mailing list in order to find out the high return to investment group,
so that they could determine the new mailing target groups. Banks and credit
companies classify the credit scoring to identify the customer segments, which
has lower risks.
Relationship management: Data mining discovers and identifies the previous
unknown relationships hiding in the data. The buying patterns of a customer
are of interested to by the retailers and advertisers. Combined with customer
segmentation, data mining could help them to find out the relationship between
the purchase of product items, and customer types, or to improve the conduction
of a advertisement campaign on special media for specific group of customers.
42Carbone, Patricia L.
-
2. Data Mining 31
2.2.2 Operations
Predictive and classification modelling: Predictive model uses the contentsof database, which reflect historical data to automatically generate a model
that can predict a future behaviour. Classification sub-divides a data set
according to number of special outcomes. The goal of modelling operation
is to create the generalised character characteristics description for the data.
For instance, a marketing executive may be interested in predicting whether
a particular consumer will switch to a new product.
Link analysis: The goal of link analysis is to establish the relationshipbetween the records in database. The retailers want to know which items
will be purchased by a customer together in order to make decision in the
items layout and goods purchasing. For instance, if it is found that customer
will buy a CD after the purchasing a CD Player, then the store manager
should decide to put the CD counter close to the CD player counter.
Database segmentation: The database often contains various types of data,so that it is often necessary to segment the data into small groups with
related records. The purpose could be either to obtain a general descrip-
tion for each collection or to prepare for a further analysis, such as model
creation or link analysis. Suppose the store manager wants to know the
combination of goods purchased by customer in a particular visit period.
The database could first be segmented according to time period attribute,
such as Christmas sale. Then the link analysis could be conducted to
find out the relationship between the combined goods.
Deviation detection: The aim of deviation detection is to identifying theoutlier in a particular dataset whether its presentation is due to noise, im-
purities or causal reason. This operation is opposite to database segmenta-
tion, and is often carried out together with segmentation. Because outliers
express the deviation from some known expectation and norm, therefore,
deviation detection often is the source of true discovery.
2.2.3 Data Mining Techniques
Numerous techniques support the operations of data mining to find the desired
groups or relationships.
-
32 2. Data Mining
Classification and predictive modelling is supported by supervised induction tech-
niques. Clustering supports database segmentation. Association discovery and
sequence discovery are used for the link analysis. The deviation detection is
supported by statistical techniques.
The desired relationships to be discovered by data mining are:43
Classes: in which the data items is located into predetermined groups.
Clusters: in which the data items are grouped by logical relationships.
Associations: data is mined to identify associations.
Sequential patterns: data is mined to anticipate the behaviour patterns and
trends.
Supervised Induction
Supervised induction is the process to automatically create a classification model
from a sets of records (example)44, which is called the training sets. The records
in the training set must belong to a set of pre-defined classes. Each class has a
distinguishable pattern, which is generated from the existing records. Once the
model is set up and induced, a new record could be automatically put into a class
according to its pattern.
Supervised induction contains steps of classification and prediction to put ele-
ments into ppredetermined erformed groups according to some criterion. The
numbers of subgroups and the feature of each subgroup are defined at beginning.
Then, the feature of the observation will be compared with the criterion and then
be put into corresponding ed group.45 This is usually done in two steps:
Step 1: Build a model to describe the predetermined data set groups orclasses. The model contains a set of classification rules (labels).
Step 2: If the accuracy of the model or classifier is acceptable, the modelcan be used to classify the new unlabeled data groups or elements.
Clustering Clustering is a method of grouping data elements into homogenous
groups. It divides a heterogeneous data set into disjoint sub-groups, so that the
elements in any ner one cluster is highly similar, while the elements in different
43Chung, H. M., Gray, P. and Manino, M., 199844IBMs Data Minging Technology, 1996.45Han, J. and kamber M., 2001, P279-325
-
2. Data Mining 33
clusters are with highly dissimilarity. Clustering is an unsupervised technique and
is employed when you wan to find groups of similar records without any precon-
ditions. The elements inside a cluster are highly similar to each other, while the
elements between clusters are highly dissimilar according to some criterion. The
difference between clustering and classification is that in clustering, the numbers
of subgroups and the features (label) of each subgroup are unknown in advance,
while in classification, the numbers of subgroups and the feature of each subgroup
are defined at the beginning.
Cluster analysis has two steps:46
Choose a proximity measureA proximity measure decides the similarity or closeness of objects. The
homogenous objects are more similar and closer.
Choose a clustering strategyIn this step, the clustering algorithm and/or initial parameters are decided.
According to the chosen proximity measure and method, the whole data
set is divided into groups (clusters). The elements within a group should
be as closer as possible and the dissimilarity between groups should be as
large as possible.
After the clusters are built, normally some descriptive methods could will be
employed to describe each cluster in order to get a comprehensive overview of the
dissimilarity between clusters.
1. Proximity measure
The commonly used proximity measures include Jaccard, Tanimoto, Simple
Matching, Minkowski Kulczynski and Euclidean distance.
2. Clustering strategy (method)
The clustering methods generally belong to several major family:47
1. Hierarchical algorithms
2. Iterative partitioning
3. Density search
46Hardle, W. and Simar, L, P295-313.47Aldenderfer M. S. and Blashfield, R. K., P35.
-
34 2. Data Mining
4. Factor analytic
5. Clumping
6. Graphic theoretic
Here we only discuss two basic clustering algorithm methods: Hierarchical algo-
rithms and Iterative partitioning algorithm.
(1) Hierarchical algorithms
Hierarchical clusteringc can be performed using algorithm is composed of two
main types different of procedures: Agglomerative procedure and Splitting pro-
cedure.
Agglomerative procedure starts from the finest partition. It considers eachobservation as a cluster, then puts groups together to form new clusters.
At each stage in the procedure, the number of clusters is reduced by one,
by through the joining or fusing two groups into one, which are considered
to be the closest or most similar groups. Aggolomerative algorithm is a
frequently used procedure. It contains the following steps:48 49
1. Construct the finest partition. Normally each observation is a group.
2. Compute the distance or dissimilarity matrix.
3. Find out the closest or most similar groups.
4. Put the two most similar groups together to form a cluster.
5. Computer the distance or dissimilarity between the new groups, get a
reduced distance or similarity matrix.
6. Repeat the step 3 to step 5, until the optimal clusters are formed.
Splitting procedure is opposite to the agglomerative procedure. It considersthe whole data set as a cluster to start with, then splits the cluster into sub
groups to form new clusters.
The linkage for Agglomerative algorithm There are many linkages to mea-sure the proximity or similarities of elements and groups. The frequently
normally used linkages are:
48Mardia, K.V., Kent, J.T. and Bibby, J.M., 1979, P360-390.49Everitt, B. S. and Dunn, G., 1991, P99-126.
-
2. Data Mining 35
Single linkage defines the smallest distance of individual as the distance of
two groups.
Complete linkage is opposite to the single linkage, defines the largest dis-
tance of individuals as the distance of two groups.
Average linkage (non-weighted and weighted) computes the average distance.
Centroid linkage uses the natural geometrical distance as the distance of
groups.
Median linkage chooses the median of individual distances as the distance
of groups.
Ward Linkage is related to the centroid linkage, but it uses rather an in-
teria distance rather than a geometric distance.
(2) Iterative Partitioning algorithms
Partitioning algorithms starts with given groups. Then the elements exchange
between groups until the highest homogeneity within groups and highest hetero-
geneity between groups or some criterion is reached.
The iterative partitioning algorithms are normally undertaken according to the
following steps :50
1. Begin with an initial partition of a chosen certain numbers of clusters.
Compute the centriods of these clusters.
2. Allocate each data point to the cluster that has closest centroid.
3. Compute the new centroids for new clusters. The clusters are not changed
until a complete pass through of the data.
4. Iterated the steps of (2) and (3) until no data points change clusters and
reach the highest similarity inside the cluster.
Association rule discovery
Association rule discovery is an iterative approach, also known as level-wise
search. Association rule methods try to discover interesting relationships be-
tween the items in data and identify the customers behaviour patterns. The A
typical association rule example is the Marketing basket analysis. This analysis
tries y to find out when the customers do shopping, what kinds of products are
50Aldenderfer M. S. and Blashfield, R. K., P45-49.
-
36 2. Data Mining
more likely to be put into the shopping basket together. Through this analysis,
retailers are able to identify which items are frequently purchased together by the
customers.
An association rule is the relationship of the form X Y , where X is theantecedent item set and Y is the consequent item set. For example: customers
who purchased itemX are very likely also to purchase item Y at the same time.51
There are two measures for each rule: support and confidence.52
Support (or prevalence) indicates the occurrence frequency of an itemset.s(A B) = P (A B)
Confidence (Certainty or Predictability) measures the validity of the pat-tern. It indicates, denotes how strong the strength of the relationship be-
tween the items, and to what degree an item depends on the others.
For example: Among the customers who buy computers, only 5% customers are
students. and buy laptop. But if a customer is also a student, the possibility
of his buying a computer is 20%. In this rule: 5% is support and 20% is the
confidence.
Two other important measures for association rule discovery are: Expected confi-
dence - the possibility of an items purchasing regardless what other items haves
been bought together. For instance, customers buy a computer 40% of the
time, 40% is Expected confidence.
Lift - refers to the difference between the confidence of a rule and the expected
confidence, either in the form of absolute difference or in the form of ratio. When
Lift is negative or less than one, it means the itemset of the rule are unlikely to
happen or two products are unlikely to be purchased at the a same time.
The goal of association discovery is to find out all the associations with s% support
and c% confidence in the data of transaction.
1. Data format
Two types of format are used to form the data for association discovery:
1. Horizontal format: each entry as a row, each attribute is a column.
51Kotala, P. K, Perera, A., Kai Zhou, J., etc., 200152WWW4
-
2. Data Mining 37
2. Vertical format: Only one column for attributes. Different entries are de-
noted by different ID. Attributes belonging ed to the same entry will be
assigned the same ID number.
2. Apriori Algorithm
The most often used algorithm of association rule is called Apriori algorithm. It
uses the prior knowledge of itemset features to explore their further associations.
The steps are as following:
Step 1: Set percentage of support and confidence as s% and c%.
Step 2: Find out all the items with frequency percentage above the setminimal support.
Step 3: Generate the association that have the same or higher set confidencelevel based on the set of frequent items.
Step 4: Scan all the items to identify all the items with , which at have atleast s% support.
Assign them as L1
Step 5: Form item pairs from L1, assign these candidate set as C2.
Step 6: Scan all the item pairs to find all the pairs in C2 at least with s%and c% confidence. Denote Let these sets as L2;
Step 7: Iteration: Do Step 5 and Step 6 iteratively, until there are no moresets satisfying the constraints.
The general description for Step 5 and Step 6 is:
Build sets of k items from Lk1, let it to be Ck.
Scan all transactions and find out all frequent set in Ck with at least s%support and c% confidence level, let it be Lk.
-
38 2. Data Mining
Sequential pattern discovery
Sequential pattern methods can be seen as an extended association rule method
that analyses the sequenced data. It extends association by adding time to the
transactions. For each transaction, there is a transaction time. Therefore, not
only the attributes of each transaction, but should be considered the , time when
of the transaction took place happening should also be taken into account. Se-
quential analysis searches temporal links between items, rather than relationships
between items in a single transaction.53
Sequential ce pattern method can find out the relationship patterns between the
items or itemsets in a time episode. For example, a typical sequence pattern
could be Six percent of customers who bought a CD player bought a CD within
a week.
1. Data format
To start a sequential pattern discovery, each time series is converted into a multi-
item entry and duplicated items are deleted. Afterwards, the association rule can
be used. The constraints of sequential pattern that are all sequential patterns
satisfy the customer specified minimal support.
The sequential data is composed of sequences, or customer sequences. Each
sequence is a list of customer orders. Each transaction contains a set of items.
The length of a sequence is the number of itemsets that are contained in it. A
sequence of length k is call k-sequence.
2. Procedure
Sequential pattern discovery could be conducted by using the following steps: 54
Step 1: Sort phase. Sort he database according to customer id and trans-action id.
Step 2: Itemset phase. Find all large sequences of length 1. Step 3: Transformation phase. Transform each item in the sequence intointeger.
Step 4: Sequence phase: Find all large sequences. Step 5: Maximal phase: delete all non-maximal sequences.
53Wojciechowski, Marek54Han, J and Kamber M, 2001, P225-271.
-
3. XploRe user and customer analysis55
3.1 About XploRe
XploRe is a professional statistical software for high-end statistical analysis, ad-
vanced research and interactive teaching. It was developed in 1999 by Prof. Wolf-
gang Hardle and his team at Humboldt University of Berlin, Germany. XploRe
is a module structured, command driven software. The statistical methods of
XploRe are supported by various libraries. Therefore, one can incorporate his/her
ones own methods in XploRe and easily extend the environment. The competitive
advantage of XploRe lies on rather advanced methods, particularly smoothing.
The purpose of XploRe lies in the exploration and analysis of data. According to
Prof. Hardle (1999), it aims at sophisticated users who are looking for a flexible,
programmable statisticals package with emphasis on more advanced procedures.
The Internet is currently the main marketing instrument of XploRe. A free trail
version with limitations of XploRe (with limitations) could be downloaded from
the net.
3.2 XploRe user(2002) and customer descrip-
tive analysis
3.2.1 Data collection
XploRe user data collection
XploRe users refer to the XploRe downloaders, who have downloaded XploRe
from the website. They are the potential customers of XploRe.
The collected raw data of XploRe users consists of 1734 profiles of individuals
who have downloaded the statistic software XploRe from October 11, 2001 to
July 22, 2002. The data was collected through an online survey. A free trail
version of XploRe could download via the homepage http://www.xplore-stat.de.
55User refers to the person who downloaded XploRe from Internet, while Customer refersto the person who bought XploRe.
39
-
40 3. XploRe user and customer analysis
All trial versions of XploRe (except for the Linux local version) do not include all
function and commands of XploRe, will expire after two months, and are limited
to 1000 observations. The Linux local version has no expiration date and no limit
on the size of observations.
Fig. 3.1: Sample of online survey questionnaire.
Before the downloading, users are asked to participate in an a online survey.
The online questionnaire composes mainly has two parts. All questions (except
for E-mail address) are answered by selecting from a set of items from possible
responses.
The first part of the questionnaire is Personal information, in which the informa-
tion about personal identity and preference are inquired. Some questions in this
part, such as e-mail address and country, ask for the personal identity of down-
loaders identity. We call them Identity questions. The other kind of questions
inquire about the preferences of downloaders, such as the way they learnt about
XploRe, the work place where they use XploRe, the software they currently use,
and the statistical methods they look for in XploRe, etc.. The answers to these
questions are important to reveal the preferences of users and play a prominent
role in user analysis. We call these questions substantive questions, because
they provide the basic factors needed to subdivide the total user group into small
homogenous groups for our statistic user analysis.
The second part of the questionnaire are contains technical questions. The
-
3. XploRe user and customer analysis 41
downloaders are asked to choose the preferred versions of XploRe56 and the op-
erating system, on which XploRe will be installed, such as Windows, Linux, Sun
etc.. An example questionniare is attached in the Appendix.
During downloading, the date and IP-address are automatically recorded. They
are very helpful in in data cleaning procedure.
XploRe Customer data collection
XploRe customer here refers to who haves actually bought XploRe. I call them
also call them actual customers. The data of XploRe customer is collected
through registration forms, which are sent to customer together with XploRe.
The return of the registration form is not compulsory. The customer data is from
1 July 2000 to 30 August 2002. Because of the change in registration form, the
data after this date was not used. In the Appendix, the new registration form is
attached for the reference.
The registration form includes the questions about the identity of the customer
like country, language and the questions about their fields, as well as the operating
systems.
As a the result, we get 8 variables of customer data: country, federal state (Ger-
many), language, title, operating system, profile sector, profile branch and sex.
3.2.2 Data cleaning and preparation
A analysis based on poor quality or wrong data could deliver erroneous results
no matter how sophisticated the statistical method is. Therefore, the raw data
are thoroughly cleaned before using them for analysis.
XploRe user data cleaning
When people download XploRe, obviously they would like to complete the down-
load process as quick as possible and answer the question as promptly as possible.
If the questionnaire is too tedious or too complicated, the downloader may get
impatient so that they give wrong or incomplete answers. In addition, in survey
56XploRe has three versions: Local version, Java-Client version and ReX, which is a Exceladd-in.
-
42 3. XploRe user and customer analysis
it often happens that the questionees are not very serious about the answer and
dont give actual information.
To avoid including the false information into the data, I used the personal ques-
tions as the indicators for the degree of seriousness to the questionnaire and the
possibility of false answers. Many people gave obviously wrong answers to the
personal questions. I assume that, if people gave false answer to the personal
questions, they would give false answer to substantive questions as well. Fur-
thermore, according to the given IP addressed, the suspicious observations were
inspected and then deleted according to a set of criteria.
The cleaning process was carried out mainly automatically by Excel Visual Editor.
However, the whole process of data cleaning could hardly be carried out fully
automatically. Therefore, the manually cleaning work was also taken to delete
the false information that the computer program could not identify, for instance,
the matching of IP address and the deletion of the profiles of those from XploRe
team. At the end, there was 1181 profiles for analysis after the cleaning.
XploRe customer data cleaning
The cleaning procedure of customer data is relativelyly simple. We suppose that
the customer knows their answer will help XploRe to improve its service, there-
fore, they intend to provide right information. The cleaning process, therefore,
only include the deletion of doubled customer information.
3.2.3 Data descriptive analysis and result
In the first step, the descriptive analysis was conducted with XploRe to give an
overview of the data.
XploRe User descriptive analysis
From the Table in Appendix 1, XploRe user frequency analysis, we can see the
frequency and percentage of each variable.
Concerning the resources of getting to know XploRe, WWW/Newsgroup are
the main resource. 42.9% of the downloader first learn about XploRe through
Internet. The second main resource is Publications and Journals, 20% users use
these channels to know about XploRe.
-
3. XploRe user and customer analysis 43
49.4% of users work in a university, and 9.1% of users work in research institute.
The users from Private, Non-research Company have a percentage of 6.6%. The
interesting point is that a high percentage of users work at home. With 28.9% of
the users, this group is the second biggest group in this category.
Excel is the most popular software, which is used by 25.1% of total users. The
next are SPSS and MatLab, with 11.2% and 10.4% of users respectively. XploRe
is a command driven software, competitive in rather advance statistical methods.
The software such as S-Plus and GAUSS have more similar feature and scope
with XploRe, their users comprise 5.5% and 4% of the total respectively. This
fact shows that most users are more likely to choose more standard software
such as Excel and SPSS, because of the higher programming requirement and
difficulties in using a programmable matrix oriented software like XploRe. But
the relatively high percentage of MatLab user underlies a sign for opportunity for
XploRe because MatLab is also a program-oriented software. There is chance for
XolpRe marketing to get this type customer.
A great part of XploRe users work in the field of Econometrics. The other pop-
ular work fields are Mathematical Statistics, Finance and actuarial science, and
Physics and engineering. Each consists of about 10% of users.
The most often used statistical methods, corresponding to the users work, are
time series, followed by Basic statistics, Multivariate methods and Linear models.
But regarding to the methods that the users look for in XploRe, there are some
differences. The most wanted statistical method are Time series and Multivariate
methods, while Non- and semi- parametric methods, Graphics and exploratory
data analysis are ranked as the third and forth most wanted methods, respectively.
This difference indicates that the existing statistical software are weak at Non-
and Semiparametric methods and Graphic/Exploratory methods. Therefore, the
users try to discover more powerful instrument related to these two methods.
XploRe could emphasis its strength in these two analysing methods, thus, expand
its customer base.
86.5% of users downloaded the local version of XploRe, 9.3% downloaded ReX
version of XploRe, which is a statistical Microsoft Excel 2000 add-in. Only 4.1%
of users downloaded the XploRe - Java - Client version.
Windows-NT is the dominant platform of local version with 84.1% of users. Linux
is also relativelyly popular, 13.2% of users downloaded XploRe Linux version.
Concerning Client version, windows- NT is still the dominant platform. Linux
only account for 6.1%. Other platforms account for very small fractions.
-
44 3. XploRe user and customer analysis
Name Type Modal Value Modal Freq. No. of Values
First Learn Categorical WWW, Newsgroup 42.9% 5
Work Place Categorical University 49.4% 6
Software Categorical Excel 25.1% 17
Work Field Categorical Econometrics 24.1% 10
Method Used Categorical Time Series 18.7% 12
Method Looked for Categorical Time Series 17.3% 12
Xversion Categorical Local 86.5% 3
Platform L Categorical Windows NT 84.1% 4
Platform C Categorical Windows NT 87.8% 4
OS Platform Categorical Windows NT 84.2% 4
Country Categorical Germany 16.9% 77
Continent Categorical Europe 52.7% 4
Tab. 3.1: Summary and decription of the varibale of User 22/07/02 data
XploRe Users are with various national backgrounds. Users from Germany
(16.9%), USA (15.7%) and Japan (8.6%) consist of half of the population.
More than half users are from Europe, 52.7%. The following are America and
Asia-Pacific, with 24.5% and 20.5% respectively. The reason might be that
XploRe origins from Germany. The information and marketing are more active
in Europe than in other areas.
Since the variables are categorical, we could draw a picture of the typical user of
XploRe. The modal user of XploRe is some one who is from Germany, works in
a university, learnt about XploRe through Internet. He uses excel as the main
software for statistical, and he works in the field of econometrics. Time series are
his main analysis method, and he looks for the software that performs better in
Time series methods. He downloads the local version of XploRe and windows-NT
is his platform.
XploRe Customer descriptive analysis
The result of the descriptive analysis of XploRe customer is summarised in the
Table of Appendix 2.
The customers of XploRe are come mostly from Germany, which compose 34.4%
of the total customers. Customers from USA are the second biggest group, with
-
3. XploRe user and customer analysis 45
Name Type Modal Value Modal Freq. Missing value
State Categorical Germany 34.4% 3.1%
Federal State Categorical Baden-Wurttenberg 3.1% 84.4%
Sex Categorical Man 21.9% 0.0%
Language Categorical English 18.8% 59.4%
Title Categorical Prof. 9.4% 78.1%
OS Platform Categorical Windows 31.1% 68.8%
Sector Categorical Research Institute 34.4% 62.5%
Branch Categorical Economics 9.4% 78.1%
Note: 1. Federal state refers to the states of Germany
2. Federal state has no modal value, because all the value have the
same percentage (3.1%).
Tab. 3.2: Summary and descripiton of the variables for customer data
percentage of 25%. The following are Japanese customers, 9.4%. The customers
from Italy consist of 6.2% of the total customers. There are customers from
Denmark, France, Norway, The Netherlands, UK, China and Taiwan, they each
have 3.1% percentage of the customers. Therefore, Europe is the main customer
market of XploRe, followed by America and Asia.
78.1% of XploRe customers are men. Women have a relativelyly lower percentage,
only 21.9%. This is in correspondence with the facts of the XploRe users.
English is the main language used among the customers, followed by German,
French and Italian.
The customer of XploRe are highly intellectual, 21.8% of them own the title of
Prof., Dr, or Prof.Dr..
34.4% of customers work in research institutes. 3.1% of them work in companies.
Windows is the most popular platform. 21.3% of the customers use Windows as
their computing platform.
The professional fields, in which the customers work, are diverse. Econometrics
has a higher percentage of 9.4% among them. The other professional fields indi-
cated in the data are statistics, biostatistics, mathema