fuzy data mining

Customer Analysis for Software XploRe

From Data Mining to Marketing

Strategy

Diplomarbeit

zur Erlangung des akademischen Grades eines

Master of Science

an der Wirtschaftswissenschaftlichen Fakultat

der Humboldt-Universitat zu Berlin

Eingereicht von

Jianqiu Wang

Am 27. Mai 2003

Matrikel-Nr.: 161426

Prufer: Prof. Dr. Wolfgang Hardle

Contents

Abstract 1

Introduction 3

1. Customer analysis 5

1.1 Customer Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.1 Customers Black Box . . . . . . . . . . . . . . . . . . . 5

1.1.2 Consumer buying process . . . . . . . . . . . . . . . . . . 6

1.1.3 Customer behaviour model . . . . . . . . . . . . . . . . . . 8

1.1.4 Factors influencing customer buying behaviour . . . . . . . 10

1.2 Market Segmentation and Profiling . . . . . . . . . . . . . . . . . 12

1.2.1 Market segmentation . . . . . . . . . . . . . . . . . . . . . 13

1.2.2 Customer profiling . . . . . . . . . . . . . . . . . . . . . . 22

1.3 Market targeting and Positioning . . . . . . . . . . . . . . . . . . 23

1.3.1 Market Targeting . . . . . . . . . . . . . . . . . . . . . . . 23

1.3.2 Positioning . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2. Data Mining 26

2.1 The process of Data mining . . . . . . . . . . . . . . . . . . . . . 26

2.1.1 Data Collection and Selection . . . . . . . . . . . . . . . . 26

2.1.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . 28

2.1.3 Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.1.4 Result Interpretation . . . . . . . . . . . . . . . . . . . . . 29

2.2 The Aspects of Data Mining . . . . . . . . . . . . . . . . . . . . . 29

2.2.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.2.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2.3 Data Mining Techniques . . . . . . . . . . . . . . . . . . . 31

i

ii Index of contents

3. XploRe user and customer analysis 39

3.1 About XploRe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 XploRe user(2002) and customer descriptive analysis . . . . . . . 39

3.2.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2.2 Data cleaning and preparation . . . . . . . . . . . . . . . . 41

3.2.3 Data descriptive analysis and result . . . . . . . . . . . . . 42

3.2.4 Comparing the user and customer of XploRe . . . . . . . . 46

3.2.5 Measures of Improvement . . . . . . . . . . . . . . . . . . 46

3.3 Cluster analysis for XploRe user data 2002 . . . . . . . . . . . . . 47

3.3.1 Cluster analysis of categorical data . . . . . . . . . . . . . 47

3.3.2 Clustering with IBM intelligent Miner . . . . . . . . . . . 53

3.3.3 Cluster analysis with XploRe . . . . . . . . . . . . . . . . 59

3.3.4 Comparison of Cluster Analysis Results: IBM Intelligent

Miner versus XploRe . . . . . . . . . . . . . . . . . . . . . 63

3.4 Analysis of the latest User data (2003) . . . . . . . . . . . . . . . 63

3.4.1 Results of analysis of 2003 data . . . . . . . . . . . . . . . 63

3.4.2 Comparison of historical user data . . . . . . . . . . . . . 72

3.5 Complementary analysis . . . . . . . . . . . . . . . . . . . . . . . 78

3.5.1 Analysis of regrouped data . . . . . . . . . . . . . . . . . . 78

3.5.2 Analysis of high profitable sector . . . . . . . . . . . . . . 82

4. Suggested marketing strategy for XploRe 85

4.1 Marketing Strategy and Marketing mix . . . . . . . . . . . . . . . 85

4.1.1 marketing strategy . . . . . . . . . . . . . . . . . . . . . . 85

4.1.2 Marketing Mix . . . . . . . . . . . . . . . . . . . . . . . . 85

4.2 Develop the marketing strategy for XploRe . . . . . . . . . . . . . 91

4.2.1 Niche market strategy . . . . . . . . . . . . . . . . . . . . 92

4.2.2 Target Market . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.2.3 Product position of XploRe:103 . . . . . . . . . . . . . . . . 92

Index of contents iii

4.2.4 General XploRe marketing strategy pyramids . . . . . . . 93

4.2.5 General Marketing Mix . . . . . . . . . . . . . . . . . . . . 96

4.2.6 Special marketing mix for clusters . . . . . . . . . . . . . . 101

4.2.7 Marketing research - suggestions for further analysis . . . . 103

References 107

Appendix 116

Appendix 1: User 220702 Frequency Analysis . . . . . . . . . . . . . 117

Appendix 2: Customer Frequency Analysis (Nov. 05) . . . . . . . . . . 120

Appendix 3: Customer Registration form. . . . . . . . . . . . . . . . . 121

Appendix 4: Characteristics of User220702 Clusters by XploRe . . . . . 122

Appendix 5: User 130303 Frequency Analysis . . . . . . . . . . . . . 123

Appendix 6: User 13032003 Intelligent Miner Cluster Analysis . . . . 126

Appendix 7: Comparison of User and Regrouped User Data . . . . . . 128

Appendix 8: User 130303 (Regrouped) Frequency Analysis . . . . . . 129

Appendix 9: Regrouped User Intelligent Miner Cluster Analysis . . . 132

Appendix 10: Institute Users Frequency Analysis . . . . . . . . . . . 134

Erklarung zur Urheberschaft 137

iv Index of contents

List of Figures

1.1 The customers Black box. . . . . . . . . . . . . . . . . . . . . . 6

1.2 A sequential model of the buying process . . . . . . . . . . . . . . 7

1.3 Consumer Behaviour model. . . . . . . . . . . . . . . . . . . . . . 9

1.4 Factors influencing consumer behaviour. . . . . . . . . . . . . . . 10

1.5 The process of marketing segmentation. . . . . . . . . . . . . . . . 14

1.6 Alternative consumer demand categories. . . . . . . . . . . . . . . 15

1.7 SAGACITY. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.8 Targeting strategies. . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1 Sample of online survey questionnaire. . . . . . . . . . . . . . . . 40

3.2 Clustering of Users 2002. . . . . . . . . . . . . . . . . . . . . . . . 55

3.3 Clustering of user 2003. . . . . . . . . . . . . . . . . . . . . . . . . 67

3.4 Software used in 2000 and 2003. . . . . . . . . . . . . . . . . . . . 74

3.5 Information resource in 2000 and 2003. . . . . . . . . . . . . . . . 75

3.6 Clustering of regrouped user data. . . . . . . . . . . . . . . . . . . 81

4.1 4P of marketing mix . . . . . . . . . . . . . . . . . . . . . . . . . 86

v

vi Index of contents

List of Tables

1.1 Broad- based ACORN classifications 23 . . . . . . . . . . . . . . . 18

1.2 National readership survey socio-economic groups 24 . . . . . . . . 19

2.1 The aspects of data mining . . . . . . . . . . . . . . . . . . . . . . 30

3.1 Summary and decription of the varibale of User 22/07/02 data . . 44

3.2 Summary and descripiton of the variables for customer data . . . 45

3.3 Comparison of XlopRes Users and Customers . . . . . . . . . . . 47

3.4 Character characteristics of User IBM Intelligent Miner Clusters

(2002) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.5 Comparison of Clustering results with IBM Intelligent Miner and

XploRe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.6 Summary and description of the variables for User data 2003 . . . 65

3.7 Comparison of User 220702 and User 130303 . . . . . . . . . . . 72

3.8 Comparison of software used in 2000 and 2003 . . . . . . . . . . . 73

3.9 Comparison of information resources in 2000 and 2003 . . . . . . 74

3.10 Comparison of country in 2000 and 2003 . . . . . . . . . . . . . . 76

3.11 Comparison of continent in 2000 and 2003 . . . . . . . . . . . . . 76

3.12 Comparison of User clusters of 2000 and 2003 . . . . . . . . . . . 77

3.13 Summary and description of the variables of regrouped User data

2003 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.14 Comparison of Institute user and General user . . . . . . . . . . 84

vii

viii Index of contents

Abstract

This thesis paper presents a case study of customer analysis with the purpose

of to developing a marketing strategy for the statistical software XploRe. The

customers analysed include the users, who downloaded XploRe free trial version

through web site and the actual customers, who bought XploRe. Descriptive

analysis was conducted for both data, which leaded to the conclusion that re-

search institutes represent is the high- profit able sector for of XploRe. For users

data, data mining method clustering was undertaken to identify the customer

segments. Two different clustering methods were tested on the same users data

set with different software IBM Intelligent Miner and XploRe. As the a result,

the users of XploRe were divided into four clusters by both methods, Internet

surfer,Academia, Linux user and Home worker. Through the comparison

of historical data for of user data 2003 and data 20020, more facts and trends

of XploRe market and customers were discovered regarding the software used,

information resource, new market and the undergoing changes in customer seg-

ments. Based on the results of customer analysis, the suggestions for marketing

strategy, marketing mix and further analysis were outlined.

Key words: customer analysis, market segmentation, data mining, clustering,

marketing strategy, marketing mix

1

2 Abstract

Introduction

Customer analysis is a crucial step for the development of marketing strategy.

Only when the company has a clear view of its customers could , the proper

strategy and actions could then be undertaken to gain competitive advantage in

the market.

In the current time, together with the development of digital data management

systems, the capability for of gathering, storing and accessing to the information

has improved dramatically. This trend brings the difficulty for companies when

they confront the huge amount of data. Data mining is a important technology

for the companies to conduct customer analysis for large data set. It discoveries

valuable information which is useful for marketing.

The research presented in this paper tried to segment the customers and find

the trends and facts of XploRe market, so that the suggestions for marketing

strategy could be derived based on the results. XploRe is a statistical software

which aims at sophisticated users who are looking for a flexible, programmable

statistics package with an emphasis on more advanced procedures.1 It is impor-

tant for XploRe marketer to understand its customer and market. The customer

data studied here include the data of XploRe users (the potential customer) and

actual customers (the buyers). The user data was collected through an online

questionnaire preceding the downloading process of XploRe trial version, while

through the returned registration forms the customer data was gathered. With

the purpose of comparison, two sets of user data were analysed and two cluster-

ing methods were tested with two software IBM Intelligent Miner and XploRe.

The user data 2002 is from October 11, 2001 to July 22, 2002 and with 1734

profiles. The raw data of user data 2003 contains 2593 profiles and is collected

from October 11, 2002 to March 13, 2003. The customer data includes data of

32 profiles from July 1, 2000 to August 30, 2002.

Only descriptive analysis was taken for customer data due to its low amount

of records. For user data, the data mining process of clustering was conducted

to segment the market. The mining run for user data consists of several steps:

cleaning the raw data with MS Excel, transferring data to IBM Intelligent Miner

or XploRe, performing cluster analysis. The clustering identified four groups

of XploRe customers, namely Internet surfer, Academia, Linux user and

1Hardle, Klinke and Muller, 1999, P17.

3

4 Introduction

Home worker. Each cluster possesses its distinguishable features.

The comparison of customer and user 2002 leaded to the discovery of high prof-

itable sector research institute. XploRe and IBM Intelligent Miner (IM) delivered

similar clustering results for user data, but IM performed better in visualisation

and computational efficiency. Comparing the results of historical data between

user data 2003 and user data 2000, some trends were identified. More professional

users switched to command driven software. XploRe made progress in commu-

nicational channels. Asia, especially Japan emerged as new market. From the

aspects of segments, Internet surfer is a brand-new group in 2003, which indicates

the entering of Internet age. The appearance of Home worker in 2003 instead of

Researcher in 2002 gives hint in the problem in the survey questionnaire. More

Academia take non-personal channels to get information. This again confirms

the improvement made by XploRe in communication channels. Linux users were

very stable during the period.

Based on the findings of analysis, some suggestions for marketing strategy and

further analysis were made for XploRe marketer.

This paper consists of mainly four parts. The first two sections following the

introduction lay the theoretical foundation for the customer analysis and data

mining. Section three is presents engaged for the analysis and results. Marketing

strategy and suggestions are developed in the fourth section. At the end, the

summary gives a brief overview for the whole paper.

1. Customer analysis

In the current market space, the competition is intensive. The market is abundant

with all kinds of products. To win the decision of customers to their products, the

companies should get a deep sight into what the customers really need and how to

influence their purchasing e decision. Therefore, the companies should now have

a customer focus conducting business with the emphasis on the understanding

of the customers and the market.

Customer analysis is the study of customers and their behaviour, which is central

to achieve a customer focus. 2 The purpose of conducting customer analysis is

to achieve marketing goals, such as the following: 3

Customer acquisition finding the new customer

Customer cross sell further sales of different products to the same customer

Customer up sell the customer makes greater use of the same product orservice

Customer retention keeping the customer loyal

1.1 Customer Behaviour

In order to understand the customer buying behaviour, we should first understand

the customer behaviour.

1.1.1 Customers Black Box

Customer behaviour here means that the behaviour of individuals who purchase

for private or household consumption. These customers buy goods which are not

a part of the value chain, and the purpose of purchasing is not to generate profit.

Buying behaviour depends on the individual reaction to the internal and external

stimuli; therefore, it is difficult to predict. Black box is the item that describes

2WWW143Heygate, Richard, 1998.

5

6 1. Customer analysis

the customer purchasing decision, which is difficult to access but is crucial for the

purchasing determination.

In order to develop appropriate products that are attractive to the customers,

firms need to have an insight into what happens in the black box. Figure ??

presents the customers black box. In the customers black box, the customer

actually gather information, evaluate and compare, then come to a decision, which

is called the Consumer buying process.

Blackbox

-Identificationofneeds-Evaluationofoffers

thatSatisfyneed-Comparsionofsubstitute

productsandbrands-Purchase-Post-purchaseevaluation

AspirationsMotivationEducationPersonalityBeliefs

Externalstimuli

-Socialpressure-Legalrequirments-Physicalfactors-Economiccycle

Consumer

People Place - Promotion -- -- Product Price Process Physicalenvironment

Marketer

7Ps

Fig. 1.1: The customers Black box.

1.1.2 Consumer buying process

Buying decision process

The buying process starts with the customers desire of a product. This want

might be the result of internal stimuli like hunger and thirsty or the result of

external stimuli, such as advertisement.

Next step is the search for information. The consumers may collect information

consciously or unconsciously from various resources. There are four kinds of

information resources:

1. Personal sources such as family, friends, colleagues and neighbours;

3Bannes, E., McClelland, B.,etc., 1997, P139.


Recognitionof

theproblem

Thesearchfor

information

Evaluationofthe

alternatives

Thepurchase

decision

Post-puchase

behaviour

Fig. 1.2: A sequential model of the buying process

2. Public sources such as the mass media and consumers organisation;

3. Commercial sources such as advertising, sales staff and brochures;

4. Experimental sources such as handling or trying the product.

Through information gathering, the customers get aware become aware of the var-

ious products and brands in the market, then they will evaluate the alternatives,

and finally make the purchase decision.

After purchasing major items or expenditure, many people experience cognitive

dissonance also called post purchase anxiety. They wonder whether they have

made the correct purchasing decision. To reduce this anxiety, they will look for

confirmation. For example, they might ask friends to approve that their purchase

is a right choice.

Figure 1.2 summarises the stages of consumer buying process: Recognition of the

problem, The search for information, Evaluation of the alternatives, The purchase

decision and Post-purchase behaviour.

Companies should present themselves in each buying process stage and try to

be distinguished among all other products and brands of competitors. To let

a brand or product be the final choice of customer, companies need to have

clear understanding of the evaluative criteria used by consumers in comparing

products, which was mentioned before.

3Wilson, R. W. S. and Gilligan, C., P170.


Five buying roles

The purchase process normally involves several persons, each has his distinct role.

Each role doesnt necessarily require to be the a different person. One person can

play several roles in a purchasing process.

The five roles in a purchasing process are:

The Initiator: The person who suggests buying the product or service.

The influencer: Person whose comments can affect the decision of purchas-ing.

The decider: The person who decide whether to buy and which product tobuy.

The buyer: Who executes the purchase.

The user: The final consumer of the product or service.

For example, a mother buys ice cream for her child. The child is the user; the

mother is the decider and buyer. The company should understand the function

that each role plays in the buying process in order to put effective influence on

customers buying decision through proper action.

1.1.3 Customer behaviour model

The customer behaviour model indicates the procedure and basic elements, which

happens inside the customers black box or consumer buying process.

The most basic, simplest and best known model of buyer behaviour is the AIDA,

which stands for Awareness, Interest, Desire and Action.4

The model introduced here composes of six interrelated components.5

1. Information or facts: refers to the precept caused by stimulus.

2. Product recognition defines to what the extent the buyer knows about the

product to distinguish it from others products.

4Baker, M. and hart, S., 1999, P63.5Howard, J. A., 1994, P31-56.


F RI P

A

C

Fig. 1.3: Consumer Behaviour model.

3. Attitude towards the product refers to what the customer expects from the

product to satisfy their particular needs.

4. Confindence in judging the product is the customers degree of certainty that

his or her evaluative judgement of a product is correct.

5. Intention to buy is the mental state that reflects the customers plan to buy

some specific number of products from a particular brand in some specified

time period.

6. Purchase is caused by the intention to buy. It is defined as when the cus-

tomer has paid for a product or has made some financial commitment to

buy some specified amount during some specified time period.

F- Information R- product recognition C-Confidence A-Attitude I-Intention P-

Purchase

When consumers evaluate a product, they also employ certain evaluative criteria,

which have several aspects:

1. The products attributes such as its price, performance, quality, and styling.

2. Their relatively importance to the consumer.

3. The consumers perception of each brands image.

4. The consumers utility function for each of the attributes.

These evaluative criteria come cross with the elements in the consumer behaviour

model. For instance, product recognition, attitude towards the product and con-

fidence in judgement are the three parts in the buyers image of a product. They

all have vital impact on the consumers buying decision.


CultureSub-cultureSocialclass

EconomiccycleSocialpressureLegalrequirementNewtechnology

ReferencegroupsFamilyRolesandstatus

Thebuyer

CulturalEnvironmental

Social

Psychological

MotivationLearningPerceptionBeliefsandattitudes

PersonalAgeandlifecyclestageOccupationEconomiccircumstanceLifestyleandpersonality

Fig. 1.4: Factors influencing consumer behaviour.

1.1.4 Factors influencing customer buying behaviour

Various factors influence customer buying behaviour. Generally we could put

them into five categories: Psychological factors, Cultural factors, Social factors,

Personal factors and Environmental factors. 6 78

1.Psychologicalfactors

Human needs include the basic needs, like shelta, food and drink, and higher

level needs, such as friendship and achievement. People purchase goods to satisfy

their needs. The purchasing behaviour can be considered as the result of internal

and external stimuli.

Maslow (1943) has suggested that behaviour can explained by a hierarchy of

needs. He grouped peoples needs into five levels and argued that when a person is

satisfied with one level of needs, he will strive for another level of needs. Maslows

five levels of needs are Physiological needs, Safety needs, Social needs, Esteem

needs and Self-actualisation needs.9

Physiological needs are the basic needs for human being to survival, such as food

and drink. Only after these needs are satisfied, the other level of needs will be

6WWW117Bannes, E., etc., 1997, P139-149.8Environmental factors are external factors, while the other four factor categories are internal

factors that influence consumer buying behaviour.9Bannes, E., Mcclelland, B., etc., 1997, P139-184.


desired.

Safety needs refers to peoples needs for security, stability and predictability. Ser-

vices, such as insurance, guarantees, etc. are the products to satisfy humans

safety needs.

Social needs explain the humans desire of love and sense of belonging. At this

level, people will seek to join association and clubs.

Self-actualisation is the highest level of needs. It demonstrates itself in the search

of status, esteem, achievement and recognition. To satisfy this level of needs,

people turn to the luxurious products, like perfumes, high-tech products, cars,

etc..

Only after people achieve all these level of needs, they will then turn to the

realisation of their potential, which is expressed in concern for external issue, like

volunteer work.

2. Personal factors

Personal factors are the set of buyers personal characteristics, including age,

occupation, lifestyle, personality, and economic circumstances.

3. Cultural factors

Culture factors include culture, sub-culture and social class.

Culture is a set of shared values, which define peoples behaviour. Language is

the best example of culture difference. Not rightly using a language will cause

misunderstanding. And also there are attitude differences between eastern and

western culture towards family and individual.

A large society or culture is normally divided into subculture groups, which define

more subtle behaviour norms. Subculture groups include ethnic groups, religious

groups, racial groups and geographical groups etc.. They exhibit the difference

in culture preference, ethnic taste, attitudes, life style and taboos.

Social class is also called socio-economic group. It is decided by the income level,

education and occupation. The often-used social class model divides the society

into upper class, upper middle class, lower class, upper working class, working

class and others.

4. Social factors

Social factors includes reference groups, family, social role and status.

Reference groups are defined as all groups that have a direct (face-to- face) or


indirect influence on the persons attitude or behaviour.10 Reference groups can

be divided into four types.

1. Primary membership groups are generally informal, and interact within the

members, such as family, neighbours, colleagues and friends.

2. Secondary membership groups are more formal than primary memberships,

and the interactions between members are less. These include religious

groups, professional groups, trade unions.

3. Aspirational groups are groups that one would like to belong to.

4. Dissociating groups are groups, whose values and behaviour are rejected by

the individual.

5. Environmental factors

Environmental factors consist of economic, social, political, technological aspects.

Economic cycle, social pressure, legal requirements, new technology all will influ-

ence consumers purchase decision on which product to buy and the way to buy

it.

1.2 Market Segmentation and Profiling

When firms try to sell their products in customer markets, they should not only

try to identify the factors that influence the customers black box, but also to

estimate whether there is enough number of customers who need their offer. It

is important for the companies to compare their capabilities and the objectives

of customers, so that they can decide whether they are able to serve the market

with appropriate products profitably. Therefore, firms must identify market need,

segment the total customer into potential customer groups, which are likely and

able to purchase the offer, and also position the product or service as attractive

alternative to other offers of the target groups.

10Wilson, Gilligan and Person, 1994, P160.


1.2.1 Market segmentation

Market segmentation is the subdivision of a market into distinct subsets of

customers, where any subsets may conceivably be selected as a target market to

be reached with a distinct marketing mix.11

Market segmentation is inspired by Kotlers Targeting marketing. As Kotler

said, that in target marketing, the seller distinguishes the major market seg-

ments, targets one or more of these segments, and develops products and services

tailored to each selected segments. 12

Because each individual has different preference, characteristics, taste and inter-

est, their buying behaviour patterns are various and heterogeneous, it is almost

impossible or unprofitable for a company or single product to serve all of the

needs. Furthermore, the communication of marketing mix to a non-homogenous

group will also be inefficient. Therefore, the companies search for the groups

with attractive attribute, then concentrate on them to develop specific products,

services and to utilise specific marketing resources to gain the maximal market

return.

Segmentation identifies the subsets of buyers who share the similar needs and

demonstrate the similar buying behaviour. It subdivides a heterogeneous total

customer market into smaller, manageable and homogenous clusters by criteria.

The similar patterns of buyers needs and buying behaviour, which are identifiable

and relevant to the buying decision, exist in each cluster.

Customer segmentation brings major benefits to the companies:13

EfficiencyBecause the customers are subdivided, companies could only focus on the

interested markets. Therefore, they could allocate and utilise their resources

more efficiently.

EffectivenessThrough segmentation, the needs of each customer segments could be bet-

ter identified and examined. Thus, the understanding and awareness of the

customer needs could be enhanced. The companies could tailor their prod-

ucts and marketing measures to meet customer needs more effectively. Due

11Kotler, 1995, p286.12Kotler, 1991, P262.13WWW29.


Definingthemarket

Selectingthebaseforsegmentation

Dividingthemarketandprofiling

Fig. 1.5: The process of marketing segmentation.

to the improved marketing effectiveness, the response rate of customer will

also increase, thus, the return and profit from marketing investment will

also be improved.

New MarketSegmentation could help companies to identify the new market opportu-

nities. The needs and characteristic of the total customer /market are so

various diverse that some unique feature of a small group are not distin-

guishable. After segmentation, company could discover those markets with

unique features. They could offer the valuable opportunities for companies

to enter new markets.

The process of market segmentation14

The process of market segmentation is composed of three steps.

1. Defining the market

The total market for a product or service comprise oses all of the consumers who

14Bannes, E., McClelland, B., and Meyer, R, 1997, P181-185.


HomogeneousdemandConsumershaverelativelysimilarneedsordesiresforaproductorservicecategory

Diffuseddemand

Consumersneedsanddesiresaresodiversthatnoclearclusters(segments)canbeidentified

Clustereddemand

Consumersneedsanddesirescanbegroupedintotwoormoreidenitifiableclusters(segments),eachwithitsownsetofpurchasecriteria

Fig. 1.6: Alternative consumer demand categories.

desire or potentially desire it, and willing to and able to buy it. It is necessary

to analyse the market in terms of its size and pattern of demand.

There are three patterns of demand categories: 15

1. Homogeneous demand

All consumers in a market have similar needs and wants.

2. Diffused demand

Consumers needs are diverse and no clear segments can be identified. This

suggests the need for customisation.

3. Clustered demand

Consumers need and desires can be grouped into several identifiable seg-

ments. Each has its own set of purchase criteria.

2. Selecting the approach and bases for segmentation

Identification of market segmentation could be conducted based on detailed mar-

ket research, or on basic analysis of customer data held within a company. Many

companies keep customer records detailing information such as age and gender.

15Bannes, E., McClelland, B, etc. , P181-183.


There are generally two types of methods for of market segmentation.16 17

1. A Priori methods:

In a prior approach, the basis for segmentation is set in advance. The primary

market research is not necessary. Thus, the analysis of second data resources,

the customer information at hand, manger intuition and other methods will be

employed to set the segmentation basis for the buyers according to their usage

patterns (heavy, medium, light and non-user), demographic characteristics (age,

sex, income) or psychographic profiles (personality). After the basis setting, a

research will be conducted to identify the size, location and potential of each

segment. The marketing decision will be based on which segment the marketing

efforts should be concentrated. For example, classification is a prior approach.

2. Post hoc methods:

Post hoc approach segments the market depending on the research finding, rather

than decides the segmentation basis in advance. The primary market research is

conducted to collect the classification and descriptor variables. Segments will be

defined only after all the relevant information is collected and analysed. The re-

search might highlight the particular attributes, attitudes or benefits, with which

particular groups of customers are concerned. The result then becomes the basis

for dividing the market.

3. Dividing the market and profiling the segments

Based on the data gathered, the process of dividing the market into identifiable

market segments is carried out. The information obtained will give details re-

garding to the nature of customer segments. This is called segment profiling.

Profiling associates tapes each segment with certain characteristics, and aggre-

gates the customer with similar characteristics into group and separates them

from those with different characteristics.

Criteria of customer segmentation

A market could be segmented in various ways. There are problems with segmen-

tation, such as the relevance and quality of the data, intuition, continuous process

16WWW3117Han, J. and Kamber, M, 2001, P281-319.


and over-segmentation. A good segmentation should be relevant for buying be-

haviour and satisfy the following requirements:18 19

Size: the market should be big enough to guaranty a good segmentation.It is dangerous to over segment an already very small market.

Difference: the difference between the member of the segments should existand could be measured through data collection approach.

Measurability: The company is able to collect information that measuresthe nature of buying behaviour for the segmentation.

Substantiality: The selected segmentation should be profitable regarding tothe marketing mix resources designed especially for it.

Accessibility: The extend that the marketing effort could reach the segmen-tation.

Stability over time: The segmentation should last a certain period withoutdramatic change in major features.

Responsive to communication means: The segmentation sensitive to themarketing mix and communication means.

Variables for customer segmentation

Almost all factors which affect customers buying process and decision can be

used as the variables of customer segmentation. Generally the variables for

customer segmentation can be put into five categories: Demographic, Socio-

economic Grade, Psychographics and life style, Behavioural, Geographic and

Geo-demographics. 20 21

1. Demographic variables

Demographic variables categorise the market according to the population char-

acteristics and population profiles. Customers are subdivided into groups based

on one or more demographic variables such as age, sex, religion, race, nationality,

family size and stage of family life cycle. For example, the custom seller groups

18WWW2019Wilson, R. and Gilligan, C., 1997, P275.20Kalakota, R. and Whinston A. B..21McDonald M. and Dunbar I., P85-91.


ACORN Group 1981

Population %

A Agricultural areas 1, 811, 485 4.3

B Modern family housing, higher incomes 8, 667, 137 16.2

C Older housing of intermediate status 9, 420, 477 17.6

D Older terraced housing 2, 320, 846 4.3

E Better - off council estates 6, 976, 570 13.0

F Less well-off council estates 5, 032, 657 9.4

G Poorest council estates 4, 048, 658 7.6

H Multi-racial areas 2, 086, 026 3.9

I High-status non-family areas 2, 248, 207 4.2

J Auent suburban housing 8, 514, 878 15.9

K Better-off retirement areas 2, 041, 338 3.8

U Unclassified 388, 632 0.7

Tab. 1.1: Broad- based ACORN classifications 23

customer regarding their ages. Like age of 20-30, this group are the customers,

who are more like to purchase trendy items.

2. Geographic and Geo-demographics

Geographic segmentation divides the market into different geographic units such

as countries, regions, counties, cities and postcode etc. Geographic system is

based on the proposition that the neighbourhood area in which you live will

be reflected in your professional status, income, life stage and behaviour. The

neighbourhood types are initially identified using national census data.

ACORN (A Classification of Residential Nneighbourhoods) is an example of ge-

ographic systems. ACORN classifies consumers into 43 demographic and be-

haviourally distinct clusters. The clusters are based on the type of neighbourhood,

socio-economics status and the buying behaviour and preference.22 A Broad-

based ACON classification is conducted in Great Britain in 1981. It segments

the residents in Great Britain into 12 categories.

3. Socio-economic Grade

The buying behaviour is often influenced by the social class of a person The

factors include income, status, education etc. National Readership Survey scales

22Kurs, M., Ryan, B., Lamb, G. etc., 2001.23Bannes, E., McClelland, etc., 1997, P201.


Grade Social Classification Occupation

A Upper Middle Class Higher managerial, professional or administrative jobs

B Middle Class Middle managerial, professional or

C1 Lower middle class Supervisory or clerical jobs, Junior management

C2 Skilled working class Skilled manual workers

D Working class Unskilled and semi-skilled manual workers

E Subsistence level Pensioners, unemployed, casual or low grade workers

Tab. 1.2: National readership survey socio-economic groups 24

is one of the popular classifications, which and is based on the occupation of the

main wage earner of the household.

A further development of the life stages socio-economic grade model is SAGAC-

ITY, developed by Research Services Ltd.. This model combines life stages with

income and social class.

4. Psychographic variables

Psychographics attempts to classify individuals by their attitudes, personality

and life styles.

(1)Personality

Personality is used as variable to segment the market. The earliest segmentation

was conducted by Riesman et al (1950) in early 1950s. It identified three distinct

types of social characterisation and behaviour: 25

1. Traditional directed behaviour, which changes little over time and which as

a result, is easy to predict and is used as a basis for segmentation.

2. Other directness, in which the individual attempts to fit in and adapt to

the behaviour of the peer group.

3. Inner directness, where the individuals is seemingly indifferent to the be-

haviour of others.

(2) Attitude

Attitude includes the customers attitudes towards risk, degree of loyalty, the

24Kurs, M., Ryan, B., Lamb, G. etc., 200124Blois Keith, 2000, P389.25Wilson, Gilligan and Pearson, 1994, P291


LifeCycle Income Occupation

Family

Late

Pre-family

Dependent

Betteroff

Betteroff

Worseoff

Worseoff

White-collar

White-collar

White-collar

White-collar

White-collar

White-collar

Blue-collar

Blue-collar

Blue-collar

Blue-collar

Blue-collar

Blue-collar

Fig. 1.7: SAGACITY.


likelyhood of taking new products, etc. Many of the personality variables could

also use as the descriptor of the attitude.

(3) Lifestyle

The consumers behaviour is determined by the way we live our lives as well. It

arises from a complex relationship between our aspirations, surest situation, and

perception of self, income and attitudes. Life style market segmentation offers a

detailed view of buyers because it composes of numerous characteristics related

to their activities, interests and opinions. The life style consist mainly of three

dimensions: 26

1. Activities: Work, hobbies, social events, vacations, entertainment, club,

membership, community, shopping, sports.

2. Interests: Family, home, job, community, recreation, fashion, food, media,

and achievements.

3. Opinions: Selves, social issues, politics, business, economics, education,

products, future, culture.

5. Behavioural variables

(1) Benefit sought variables

This group of variables for segmenting customer considers the motive for a pur-

chase. It groups consumers according to specific benefits that they seek in a

product. Even if two customers bought exactly the same products, the benefit

they expected may vary. Benefit segmentation is therefore based on behaviour

processes, involving thought and action, as opposed to age and socio-economic

class, which are defined according to individual characteristics. It closely identi-

fies the customers needs and represents a powerful method of understanding and

influencing behaviour.

In applying for this approach, a company should begins by attempting to measure

consumers value systems and their perceptions of various brands within a given

product class. The information gathered is then used as the basis of marketing

segmentation. Benefiting segmentation begins by determining the principal ben-

efits that the customers are seeking in the product, the kinds of people who look

for each benefit and the benefit delivered by each brand. For example, for teeth

26McDonald, M. and Dunbar, I., 2000, P89.


paste market, four segments are identified according to benefit: Seeking economy,

Decay prevention, Cosmetic and Taste benefits.

(2) User status

The market can be divided into five segments, according to user status: non-

users, ex-users, potential users, first-time users and regular users. First-time user

and potential users can be further subdivided on the basis of usage rate.

(3) Loyalty Status and Brand Enthusiasm

Loyalty status categorises the customers on the basis of the extent and depth

of their loyalty to particular brands or products. Most typically there are four

categories: Hard core loyals, soft-core loyals, shifting loyals and switchers.27

1. Hard core loyals are customers who consistently buy the same brands or

product.

2. Soft-core loyals are those who are willing to choose from a limited brand

set. Their Loyalty is divided among the limited brands or products.

3. Shifting loyals consists of consumers who shift their loyalty from one brand

to another. After they shift the brand, they will not buy the ex-brand any

more.

4. Switcher loyals are those who show no loyalty to any single brand. Their

buying pattern is typically determined either by the special offers available

or by their search for variety.

(4) Critical events

Major or critical events generate ones needs, which can be satisfied by the pro-

vision of a special collection of products and/or services. Typical examples are

marriage, the death of someone in the family, unemployment, illness, retirement

and moving house, etc..

1.2.2 Customer profiling

Customer segmentation and customer profiling are two elements of Customer Re-

lationship Management (CRM). Customer Profiling is performed after customer

segmentation. Customer Profiling is to locate clusters within the customer file

that outperform the average.28 It creates customer segment profile, which labels

27Wilson, Gilligan and Pearson, 1994, P291.28WWW18


the customers with their attributes.

Identifying the characteristic of the customers helps the company to decide which

segments will respondse best to their marketing effort. When companies get

clearer overview about the attributes and demands of the customer segments,

they could then decide what action and what resource should be taken and located

to the selected customer segments. Furthermore, according to pre-built models,

customer profiling can also be used to find potential customers and delete inactive

or bad customers.

The profiling attributes are similar as the segmentation attributes. For example,

the profiling attributes include: Geographic, Cultural and e and ethnic, Economic

conditions (Incomes and /or purchasing power), Age, Values, attributes, beliefs,

Lifestyle Knowledge and awareness, Lifestyle, Media, Recruitment method. For

acquired customer, the variable of customer behaviour could also be employed as

profiling variables, such as shopping frequency, complaining, frequency, satisfied

degree of satisfaction and preferences, etc.

1.3 Market targeting and Positioning

1.3.1 Market Targeting

The next task after customer segmentation and profiling is market targeting.

Companies choose one segment or several segments as the target market. The

target market is the market that company decides to serve. Specific marketing

mix and resources will be developed to serve the target market.

The companies normally adopts on e of the three targeting strategies:29

Undifferentiated strategy: Company ignores the difference between each cus-tomer segments, and regards the whole market as a single market. Single

marketing mix is adopted for the whole market. This is the so called mass

marketing.

Differentiated strategy: The whole market is divided into several segments.The company develops different marketing mix for different segments.

28Keith Blois, 2000, P398.29Amstrong, G.and Kotler, P., 2002, P255-258.


DifferentiatedStrategy

ConcentratedStrategy

UndifferentiatedStrategy

Organisation

Organisation

Organisation

MarketingMix

MarketingMix

MarketingMix1

MarketingMix2

MarketingMix3

Segment1

Segment1

Segment2Segment3

Segment3Segment2

Entiremarket

Fig. 1.8: Targeting strategies.

Concentrated strategy: The company chooses one or several market seg-ments, but only take the single marketing mix. Under this strategy, the

company tries to have a high market share in one or several niches markets,

instead of struggling to have a small share in the whole market. For the

firms with limited resource, this strategy is very appealing.

1.3.2 Positioning

The purpose of target marketing is to focus on the selected target market, fine-

tune the market mix to provide a group of potential customers with superior

value, therefore, to build up unique position of product in the customers view.

A products position is the complex set of perceptions, impressions, and feeling

that it induces in consumers, compared with competing products.30 Positioning

refers to the how customer think about proposed and /or present brands in a mar-

ket. 31The fundamental idea of positioning is competitive advantage. 32Through

30Bannes, McClelland, Meyer and Wiesehofer, 1997, P230.31WWW3332WWW30


the differentiated market mix, the special needs and demands of customers could

be satisfied. Thus, the customers will view the product or brand as superior to

the others, and place the product or brand with a distinct position. To position

a product, the marketer must appeal to the target customers strongly with its

strength and differences using proper marketing mix.

2. Data Mining

Data mining, which is also known as Knowledge Discovery in Database KDD,33

is a powerful new technology, which help company to identify the important

information among the sea of data. Data mining technology is commonly used

for customer analysis.

Fayyad defined data mining as a non-trivial process aimed at identifying, valid,

novel, potentially useful and ultimately understandable pattern in data.34 While

Grameier and Rudolph consider data mining in terms of all methods and tech-

niques, which allow to analyse very large data sets to exact and discover previ-

ously unknown structures and relations out of such huge heaps of details. These

information is filtered, prepared and classified so that it will be a valuable aid for

decisions and strategies.35

Data mining extract the implicit, previous unknown and potentially useful data

from the data in order to automate the process of discovering the significant

pattern and trends.

2.1 The process of Data mining

The process of data mining could be summarised in as the four stages: Data col-

lection and selection, Data preparation, Data mining, and Result interpretation.36

37

2.1.1 Data Collection and Selection

The Ways of data collection include:

In-house customer database: Companies normally keep records of cus-tomers. The information of customer could be gathered from mailing list,

receipt, memberships, warranty registrations, etc.

33Kotala, P., Perera, A., Kai Zhou, J.,ect.34Fayyad, U., Piatetsky-Shapiro, G. et. al., P6.35Grameier, J., and Rudolph A..36IBMs Data Mining Technology, 199637Bounsaythip, C. and Rinta-Runsala, E., 2001

26

2. Data Mining 27

External resource: There are resources, from which one could obtain infor-mation such as demographic information.

Research survey: The often-used way to collect particular information isto conduct a survey. The survey could be conducted through face-to-face

interview, telephone interview, and postal questionnaire or via Internet.

During the collection of data, two types of variables should be collected:38 Clas-

sification Variables classify the data set into groups. Most demographic, geo-

graphic, psychographic or behavioural variable can be used to classify customer

into segments.

Demographic variables: Age, gender, income, ethnicity, marital status, ed-ucation, occupation, household size, length of residence, type of residence,

etc.

Geographic variables: City, state, zip code, census tract, county, region,metropolitan or rural location, population density, climate, etc.

Psychographic variables: Attitudes, lifestyle, hobbies, risk aversion, per-sonality traits, leadership traits, magazines read, television programmes

watched, etc.

Behavioural variables: Brand loyalty, usage level, benefits sought, distribu-tion channels used, reaction to marketing factors, etc.

Descriptor variables are variables used to describe and distinguish each sub-

group from each other in a data set. We could say that the descriptor variables

stand for the characteristic of the represented data set. Descriptor variables must

be easily obtainable variables that already exist in or appended to the customer

files. Many classification variables could be used as descriptor variables.

The data is normally stored in a data warehouse. As the data warehouse contains

all diverse types of data, so that to conducting data mining, the data that will

be used in analysis should be selected in the first step.

38WWW7

28 2. Data Mining

2.1.2 Data Preparation

Before data can be analysed, the original collected data must be prepared first

prepared in order make to let it suitable for the analysis. Data preparation

consists of the following stages:

1. Data cleaning:

Check out abnormal, out of bounds or ambiguous items.

Strip out unwanted fields or items. Some attributes are useless for analysispurpose, such as version numbers, email address, etc.

Resolve inconsistent data formats, data encoding, geographical spellings,abbreviations and punctuation

2. Data description

Supply meta data such as row or value counts or variables

3. Data Transformation:

Convert string variables into numeral or numeric categorical variables, orinterpreting or replacing codes into text.

Check missing values. Delete or replace them by default values.

Add computed field as input or target.

Combine data from multiple sources under a common code.

Identify Find out multiple used fields that are multiple times.

Convert continuous variable into category variable for some methods.

Convert nominal data into metric data.

2. Data Mining 29

4. Data Sampling39

Required for training or model building

5. Data pruning

Identify dependent, independent and correlated columns or variables

2.1.3 Mining

At the mining stage, various techniques could be used to extract the valuable in-

formation from the final prepared data. For example: To create an accurate, sym-

bolic classification model to predict whether a reader will continue to subscribe

for a newspaper. First, clustering technique should be conducted to segment

the subscribers database; then, the rule is introduced to create a classification

model automatically for each desired cluster, through which one could predict

the behaviour of a customer.

2.1.4 Result Interpretation

Result interpretation is not only to visualise (graphically or logically) the output

of data mining, but also to filter the information and identify the most valuable

and proper result, which will help in the decision making. If the interpreted result

is not satisfactory, the data mining stage or even the whole data mining procedure

should be repeated. The final extracted information must be comprehensible.

2.2 The Aspects of Data Mining

Data mining could be distinguished between the aspects of applications, opera-

tions, techniques and algorithms.40 41

39Ferguson, Mike40WWW 441IBMs Data Mining Technology, 1996

30 2. Data Mining

Applications Database marketing

Customer segmentation

Customer retention

Fraud detection

Credit checking

Web site analysis

Operations Prediction and classification modelling

Link analysis

Database segmentation

Deviation detection

Techniques Supervised Induction

Clustering

Association discovery

Sequence discovery

Tab. 2.1: The aspects of data mining

2.2.1 Applications

Data mining is widely used in customer analysis and marketing. The following

areas cover the main application of data mining.42

Customer segmentation: Data mining tools automate the process of find pre-

dictive information in large database. The companies, especially the retailers,

banks, are interested in knowing if there are sub-group customers who exhibit

certain characteristics. They could use data mining to clustering the customers,

discover interested groups. For example, companies use data mining to analyse

the historical mailing list in order to find out the high return to investment group,

so that they could determine the new mailing target groups. Banks and credit

companies classify the credit scoring to identify the customer segments, which

has lower risks.

Relationship management: Data mining discovers and identifies the previous

unknown relationships hiding in the data. The buying patterns of a customer

are of interested to by the retailers and advertisers. Combined with customer

segmentation, data mining could help them to find out the relationship between

the purchase of product items, and customer types, or to improve the conduction

of a advertisement campaign on special media for specific group of customers.

42Carbone, Patricia L.

2. Data Mining 31

2.2.2 Operations

Predictive and classification modelling: Predictive model uses the contentsof database, which reflect historical data to automatically generate a model

that can predict a future behaviour. Classification sub-divides a data set

according to number of special outcomes. The goal of modelling operation

is to create the generalised character characteristics description for the data.

For instance, a marketing executive may be interested in predicting whether

a particular consumer will switch to a new product.

Link analysis: The goal of link analysis is to establish the relationshipbetween the records in database. The retailers want to know which items

will be purchased by a customer together in order to make decision in the

items layout and goods purchasing. For instance, if it is found that customer

will buy a CD after the purchasing a CD Player, then the store manager

should decide to put the CD counter close to the CD player counter.

Database segmentation: The database often contains various types of data,so that it is often necessary to segment the data into small groups with

related records. The purpose could be either to obtain a general descrip-

tion for each collection or to prepare for a further analysis, such as model

creation or link analysis. Suppose the store manager wants to know the

combination of goods purchased by customer in a particular visit period.

The database could first be segmented according to time period attribute,

such as Christmas sale. Then the link analysis could be conducted to

find out the relationship between the combined goods.

Deviation detection: The aim of deviation detection is to identifying theoutlier in a particular dataset whether its presentation is due to noise, im-

purities or causal reason. This operation is opposite to database segmenta-

tion, and is often carried out together with segmentation. Because outliers

express the deviation from some known expectation and norm, therefore,

deviation detection often is the source of true discovery.

2.2.3 Data Mining Techniques

Numerous techniques support the operations of data mining to find the desired

groups or relationships.

32 2. Data Mining

Classification and predictive modelling is supported by supervised induction tech-

niques. Clustering supports database segmentation. Association discovery and

sequence discovery are used for the link analysis. The deviation detection is

supported by statistical techniques.

The desired relationships to be discovered by data mining are:43

Classes: in which the data items is located into predetermined groups.

Clusters: in which the data items are grouped by logical relationships.

Associations: data is mined to identify associations.

Sequential patterns: data is mined to anticipate the behaviour patterns and

trends.

Supervised Induction

Supervised induction is the process to automatically create a classification model

from a sets of records (example)44, which is called the training sets. The records

in the training set must belong to a set of pre-defined classes. Each class has a

distinguishable pattern, which is generated from the existing records. Once the

model is set up and induced, a new record could be automatically put into a class

according to its pattern.

Supervised induction contains steps of classification and prediction to put ele-

ments into ppredetermined erformed groups according to some criterion. The

numbers of subgroups and the feature of each subgroup are defined at beginning.

Then, the feature of the observation will be compared with the criterion and then

be put into corresponding ed group.45 This is usually done in two steps:

Step 1: Build a model to describe the predetermined data set groups orclasses. The model contains a set of classification rules (labels).

Step 2: If the accuracy of the model or classifier is acceptable, the modelcan be used to classify the new unlabeled data groups or elements.

Clustering Clustering is a method of grouping data elements into homogenous

groups. It divides a heterogeneous data set into disjoint sub-groups, so that the

elements in any ner one cluster is highly similar, while the elements in different

43Chung, H. M., Gray, P. and Manino, M., 199844IBMs Data Minging Technology, 1996.45Han, J. and kamber M., 2001, P279-325

2. Data Mining 33

clusters are with highly dissimilarity. Clustering is an unsupervised technique and

is employed when you wan to find groups of similar records without any precon-

ditions. The elements inside a cluster are highly similar to each other, while the

elements between clusters are highly dissimilar according to some criterion. The

difference between clustering and classification is that in clustering, the numbers

of subgroups and the features (label) of each subgroup are unknown in advance,

while in classification, the numbers of subgroups and the feature of each subgroup

are defined at the beginning.

Cluster analysis has two steps:46

Choose a proximity measureA proximity measure decides the similarity or closeness of objects. The

homogenous objects are more similar and closer.

Choose a clustering strategyIn this step, the clustering algorithm and/or initial parameters are decided.

According to the chosen proximity measure and method, the whole data

set is divided into groups (clusters). The elements within a group should

be as closer as possible and the dissimilarity between groups should be as

large as possible.

After the clusters are built, normally some descriptive methods could will be

employed to describe each cluster in order to get a comprehensive overview of the

dissimilarity between clusters.

1. Proximity measure

The commonly used proximity measures include Jaccard, Tanimoto, Simple

Matching, Minkowski Kulczynski and Euclidean distance.

2. Clustering strategy (method)

The clustering methods generally belong to several major family:47

1. Hierarchical algorithms

2. Iterative partitioning

3. Density search

46Hardle, W. and Simar, L, P295-313.47Aldenderfer M. S. and Blashfield, R. K., P35.

34 2. Data Mining

4. Factor analytic

5. Clumping

6. Graphic theoretic

Here we only discuss two basic clustering algorithm methods: Hierarchical algo-

rithms and Iterative partitioning algorithm.

(1) Hierarchical algorithms

Hierarchical clusteringc can be performed using algorithm is composed of two

main types different of procedures: Agglomerative procedure and Splitting pro-

cedure.

Agglomerative procedure starts from the finest partition. It considers eachobservation as a cluster, then puts groups together to form new clusters.

At each stage in the procedure, the number of clusters is reduced by one,

by through the joining or fusing two groups into one, which are considered

to be the closest or most similar groups. Aggolomerative algorithm is a

frequently used procedure. It contains the following steps:48 49

1. Construct the finest partition. Normally each observation is a group.

2. Compute the distance or dissimilarity matrix.

3. Find out the closest or most similar groups.

4. Put the two most similar groups together to form a cluster.

5. Computer the distance or dissimilarity between the new groups, get a

reduced distance or similarity matrix.

6. Repeat the step 3 to step 5, until the optimal clusters are formed.

Splitting procedure is opposite to the agglomerative procedure. It considersthe whole data set as a cluster to start with, then splits the cluster into sub

groups to form new clusters.

The linkage for Agglomerative algorithm There are many linkages to mea-sure the proximity or similarities of elements and groups. The frequently

normally used linkages are:

48Mardia, K.V., Kent, J.T. and Bibby, J.M., 1979, P360-390.49Everitt, B. S. and Dunn, G., 1991, P99-126.

2. Data Mining 35

Single linkage defines the smallest distance of individual as the distance of

two groups.

Complete linkage is opposite to the single linkage, defines the largest dis-

tance of individuals as the distance of two groups.

Average linkage (non-weighted and weighted) computes the average distance.

Centroid linkage uses the natural geometrical distance as the distance of

groups.

Median linkage chooses the median of individual distances as the distance

of groups.

Ward Linkage is related to the centroid linkage, but it uses rather an in-

teria distance rather than a geometric distance.

(2) Iterative Partitioning algorithms

Partitioning algorithms starts with given groups. Then the elements exchange

between groups until the highest homogeneity within groups and highest hetero-

geneity between groups or some criterion is reached.

The iterative partitioning algorithms are normally undertaken according to the

following steps :50

1. Begin with an initial partition of a chosen certain numbers of clusters.

Compute the centriods of these clusters.

2. Allocate each data point to the cluster that has closest centroid.

3. Compute the new centroids for new clusters. The clusters are not changed

until a complete pass through of the data.

4. Iterated the steps of (2) and (3) until no data points change clusters and

reach the highest similarity inside the cluster.

Association rule discovery

Association rule discovery is an iterative approach, also known as level-wise

search. Association rule methods try to discover interesting relationships be-

tween the items in data and identify the customers behaviour patterns. The A

typical association rule example is the Marketing basket analysis. This analysis

tries y to find out when the customers do shopping, what kinds of products are

50Aldenderfer M. S. and Blashfield, R. K., P45-49.

36 2. Data Mining

more likely to be put into the shopping basket together. Through this analysis,

retailers are able to identify which items are frequently purchased together by the

customers.

An association rule is the relationship of the form X Y , where X is theantecedent item set and Y is the consequent item set. For example: customers

who purchased itemX are very likely also to purchase item Y at the same time.51

There are two measures for each rule: support and confidence.52

Support (or prevalence) indicates the occurrence frequency of an itemset.s(A B) = P (A B)

Confidence (Certainty or Predictability) measures the validity of the pat-tern. It indicates, denotes how strong the strength of the relationship be-

tween the items, and to what degree an item depends on the others.

For example: Among the customers who buy computers, only 5% customers are

students. and buy laptop. But if a customer is also a student, the possibility

of his buying a computer is 20%. In this rule: 5% is support and 20% is the

confidence.

Two other important measures for association rule discovery are: Expected confi-

dence - the possibility of an items purchasing regardless what other items haves

been bought together. For instance, customers buy a computer 40% of the

time, 40% is Expected confidence.

Lift - refers to the difference between the confidence of a rule and the expected

confidence, either in the form of absolute difference or in the form of ratio. When

Lift is negative or less than one, it means the itemset of the rule are unlikely to

happen or two products are unlikely to be purchased at the a same time.

The goal of association discovery is to find out all the associations with s% support

and c% confidence in the data of transaction.

1. Data format

Two types of format are used to form the data for association discovery:

1. Horizontal format: each entry as a row, each attribute is a column.

51Kotala, P. K, Perera, A., Kai Zhou, J., etc., 200152WWW4

2. Data Mining 37

2. Vertical format: Only one column for attributes. Different entries are de-

noted by different ID. Attributes belonging ed to the same entry will be

assigned the same ID number.

2. Apriori Algorithm

The most often used algorithm of association rule is called Apriori algorithm. It

uses the prior knowledge of itemset features to explore their further associations.

The steps are as following:

Step 1: Set percentage of support and confidence as s% and c%.

Step 2: Find out all the items with frequency percentage above the setminimal support.

Step 3: Generate the association that have the same or higher set confidencelevel based on the set of frequent items.

Step 4: Scan all the items to identify all the items with , which at have atleast s% support.

Assign them as L1

Step 5: Form item pairs from L1, assign these candidate set as C2.

Step 6: Scan all the item pairs to find all the pairs in C2 at least with s%and c% confidence. Denote Let these sets as L2;

Step 7: Iteration: Do Step 5 and Step 6 iteratively, until there are no moresets satisfying the constraints.

The general description for Step 5 and Step 6 is:

Build sets of k items from Lk1, let it to be Ck.

Scan all transactions and find out all frequent set in Ck with at least s%support and c% confidence level, let it be Lk.

38 2. Data Mining

Sequential pattern discovery

Sequential pattern methods can be seen as an extended association rule method

that analyses the sequenced data. It extends association by adding time to the

transactions. For each transaction, there is a transaction time. Therefore, not

only the attributes of each transaction, but should be considered the , time when

of the transaction took place happening should also be taken into account. Se-

quential analysis searches temporal links between items, rather than relationships

between items in a single transaction.53

Sequential ce pattern method can find out the relationship patterns between the

items or itemsets in a time episode. For example, a typical sequence pattern

could be Six percent of customers who bought a CD player bought a CD within

a week.

1. Data format

To start a sequential pattern discovery, each time series is converted into a multi-

item entry and duplicated items are deleted. Afterwards, the association rule can

be used. The constraints of sequential pattern that are all sequential patterns

satisfy the customer specified minimal support.

The sequential data is composed of sequences, or customer sequences. Each

sequence is a list of customer orders. Each transaction contains a set of items.

The length of a sequence is the number of itemsets that are contained in it. A

sequence of length k is call k-sequence.

2. Procedure

Sequential pattern discovery could be conducted by using the following steps: 54

Step 1: Sort phase. Sort he database according to customer id and trans-action id.

Step 2: Itemset phase. Find all large sequences of length 1. Step 3: Transformation phase. Transform each item in the sequence intointeger.

Step 4: Sequence phase: Find all large sequences. Step 5: Maximal phase: delete all non-maximal sequences.

53Wojciechowski, Marek54Han, J and Kamber M, 2001, P225-271.

3. XploRe user and customer analysis55

3.1 About XploRe

XploRe is a professional statistical software for high-end statistical analysis, ad-

vanced research and interactive teaching. It was developed in 1999 by Prof. Wolf-

gang Hardle and his team at Humboldt University of Berlin, Germany. XploRe

is a module structured, command driven software. The statistical methods of

XploRe are supported by various libraries. Therefore, one can incorporate his/her

ones own methods in XploRe and easily extend the environment. The competitive

advantage of XploRe lies on rather advanced methods, particularly smoothing.

The purpose of XploRe lies in the exploration and analysis of data. According to

Prof. Hardle (1999), it aims at sophisticated users who are looking for a flexible,

programmable statisticals package with emphasis on more advanced procedures.

The Internet is currently the main marketing instrument of XploRe. A free trail

version with limitations of XploRe (with limitations) could be downloaded from

the net.

3.2 XploRe user(2002) and customer descrip-

tive analysis

3.2.1 Data collection

XploRe user data collection

XploRe users refer to the XploRe downloaders, who have downloaded XploRe

from the website. They are the potential customers of XploRe.

The collected raw data of XploRe users consists of 1734 profiles of individuals

who have downloaded the statistic software XploRe from October 11, 2001 to

July 22, 2002. The data was collected through an online survey. A free trail

version of XploRe could download via the homepage http://www.xplore-stat.de.

55User refers to the person who downloaded XploRe from Internet, while Customer refersto the person who bought XploRe.

39

40 3. XploRe user and customer analysis

All trial versions of XploRe (except for the Linux local version) do not include all

function and commands of XploRe, will expire after two months, and are limited

to 1000 observations. The Linux local version has no expiration date and no limit

on the size of observations.

Fig. 3.1: Sample of online survey questionnaire.

Before the downloading, users are asked to participate in an a online survey.

The online questionnaire composes mainly has two parts. All questions (except

for E-mail address) are answered by selecting from a set of items from possible

responses.

The first part of the questionnaire is Personal information, in which the informa-

tion about personal identity and preference are inquired. Some questions in this

part, such as e-mail address and country, ask for the personal identity of down-

loaders identity. We call them Identity questions. The other kind of questions

inquire about the preferences of downloaders, such as the way they learnt about

XploRe, the work place where they use XploRe, the software they currently use,

and the statistical methods they look for in XploRe, etc.. The answers to these

questions are important to reveal the preferences of users and play a prominent

role in user analysis. We call these questions substantive questions, because

they provide the basic factors needed to subdivide the total user group into small

homogenous groups for our statistic user analysis.

The second part of the questionnaire are contains technical questions. The


downloaders are asked to choose the preferred versions of XploRe56 and the op-

erating system, on which XploRe will be installed, such as Windows, Linux, Sun

etc.. An example questionniare is attached in the Appendix.

During downloading, the date and IP-address are automatically recorded. They

are very helpful in in data cleaning procedure.

XploRe Customer data collection

XploRe customer here refers to who haves actually bought XploRe. I call them

also call them actual customers. The data of XploRe customer is collected

through registration forms, which are sent to customer together with XploRe.

The return of the registration form is not compulsory. The customer data is from

1 July 2000 to 30 August 2002. Because of the change in registration form, the

data after this date was not used. In the Appendix, the new registration form is

attached for the reference.

The registration form includes the questions about the identity of the customer

like country, language and the questions about their fields, as well as the operating

systems.

As a the result, we get 8 variables of customer data: country, federal state (Ger-

many), language, title, operating system, profile sector, profile branch and sex.

3.2.2 Data cleaning and preparation

A analysis based on poor quality or wrong data could deliver erroneous results

no matter how sophisticated the statistical method is. Therefore, the raw data

are thoroughly cleaned before using them for analysis.

XploRe user data cleaning

When people download XploRe, obviously they would like to complete the down-

load process as quick as possible and answer the question as promptly as possible.

If the questionnaire is too tedious or too complicated, the downloader may get

impatient so that they give wrong or incomplete answers. In addition, in survey

56XploRe has three versions: Local version, Java-Client version and ReX, which is a Exceladd-in.


it often happens that the questionees are not very serious about the answer and

dont give actual information.

To avoid including the false information into the data, I used the personal ques-

tions as the indicators for the degree of seriousness to the questionnaire and the

possibility of false answers. Many people gave obviously wrong answers to the

personal questions. I assume that, if people gave false answer to the personal

questions, they would give false answer to substantive questions as well. Fur-

thermore, according to the given IP addressed, the suspicious observations were

inspected and then deleted according to a set of criteria.

The cleaning process was carried out mainly automatically by Excel Visual Editor.

However, the whole process of data cleaning could hardly be carried out fully

automatically. Therefore, the manually cleaning work was also taken to delete

the false information that the computer program could not identify, for instance,

the matching of IP address and the deletion of the profiles of those from XploRe

team. At the end, there was 1181 profiles for analysis after the cleaning.

XploRe customer data cleaning

The cleaning procedure of customer data is relativelyly simple. We suppose that

the customer knows their answer will help XploRe to improve its service, there-

fore, they intend to provide right information. The cleaning process, therefore,

only include the deletion of doubled customer information.

3.2.3 Data descriptive analysis and result

In the first step, the descriptive analysis was conducted with XploRe to give an

overview of the data.

XploRe User descriptive analysis

From the Table in Appendix 1, XploRe user frequency analysis, we can see the

frequency and percentage of each variable.

Concerning the resources of getting to know XploRe, WWW/Newsgroup are

the main resource. 42.9% of the downloader first learn about XploRe through

Internet. The second main resource is Publications and Journals, 20% users use

these channels to know about XploRe.


49.4% of users work in a university, and 9.1% of users work in research institute.

The users from Private, Non-research Company have a percentage of 6.6%. The

interesting point is that a high percentage of users work at home. With 28.9% of

the users, this group is the second biggest group in this category.

Excel is the most popular software, which is used by 25.1% of total users. The

next are SPSS and MatLab, with 11.2% and 10.4% of users respectively. XploRe

is a command driven software, competitive in rather advance statistical methods.

The software such as S-Plus and GAUSS have more similar feature and scope

with XploRe, their users comprise 5.5% and 4% of the total respectively. This

fact shows that most users are more likely to choose more standard software

such as Excel and SPSS, because of the higher programming requirement and

difficulties in using a programmable matrix oriented software like XploRe. But

the relatively high percentage of MatLab user underlies a sign for opportunity for

XploRe because MatLab is also a program-oriented software. There is chance for

XolpRe marketing to get this type customer.

A great part of XploRe users work in the field of Econometrics. The other pop-

ular work fields are Mathematical Statistics, Finance and actuarial science, and

Physics and engineering. Each consists of about 10% of users.

The most often used statistical methods, corresponding to the users work, are

time series, followed by Basic statistics, Multivariate methods and Linear models.

But regarding to the methods that the users look for in XploRe, there are some

differences. The most wanted statistical method are Time series and Multivariate

methods, while Non- and semiparametric methods, Graphics and exploratory

data analysis are ranked as the third and forth most wanted methods, respectively.

This difference indicates that the existing statistical software are weak at Non-

and Semiparametric methods and Graphic/Exploratory methods. Therefore, the

users try to discover more powerful instrument related to these two methods.

XploRe could emphasis its strength in these two analysing methods, thus, expand

its customer base.

86.5% of users downloaded the local version of XploRe, 9.3% downloaded ReX

version of XploRe, which is a statistical Microsoft Excel 2000 add-in. Only 4.1%

of users downloaded the XploRe - Java - Client version.

Windows-NT is the dominant platform of local version with 84.1% of users. Linux

is also relativelyly popular, 13.2% of users downloaded XploRe Linux version.

Concerning Client version, windows- NT is still the dominant platform. Linux

only account for 6.1%. Other platforms account for very small fractions.


Name Type Modal Value Modal Freq. No. of Values

First Learn Categorical WWW, Newsgroup 42.9% 5

Work Place Categorical University 49.4% 6

Software Categorical Excel 25.1% 17

Work Field Categorical Econometrics 24.1% 10

Method Used Categorical Time Series 18.7% 12

Method Looked for Categorical Time Series 17.3% 12

Xversion Categorical Local 86.5% 3

Platform L Categorical Windows NT 84.1% 4

Platform C Categorical Windows NT 87.8% 4

OS Platform Categorical Windows NT 84.2% 4

Country Categorical Germany 16.9% 77

Continent Categorical Europe 52.7% 4

Tab. 3.1: Summary and decription of the varibale of User 22/07/02 data

XploRe Users are with various national backgrounds. Users from Germany

(16.9%), USA (15.7%) and Japan (8.6%) consist of half of the population.

More than half users are from Europe, 52.7%. The following are America and

Asia-Pacific, with 24.5% and 20.5% respectively. The reason might be that

XploRe origins from Germany. The information and marketing are more active

in Europe than in other areas.

Since the variables are categorical, we could draw a picture of the typical user of

XploRe. The modal user of XploRe is some one who is from Germany, works in

a university, learnt about XploRe through Internet. He uses excel as the main

software for statistical, and he works in the field of econometrics. Time series are

his main analysis method, and he looks for the software that performs better in

Time series methods. He downloads the local version of XploRe and windows-NT

is his platform.

XploRe Customer descriptive analysis

The result of the descriptive analysis of XploRe customer is summarised in the

Table of Appendix 2.

The customers of XploRe are come mostly from Germany, which compose 34.4%

of the total customers. Customers from USA are the second biggest group, with


Name Type Modal Value Modal Freq. Missing value

State Categorical Germany 34.4% 3.1%

Federal State Categorical Baden-Wurttenberg 3.1% 84.4%

Sex Categorical Man 21.9% 0.0%

Language Categorical English 18.8% 59.4%

Title Categorical Prof. 9.4% 78.1%

OS Platform Categorical Windows 31.1% 68.8%

Sector Categorical Research Institute 34.4% 62.5%

Branch Categorical Economics 9.4% 78.1%

Note: 1. Federal state refers to the states of Germany

2. Federal state has no modal value, because all the value have the

same percentage (3.1%).

Tab. 3.2: Summary and descripiton of the variables for customer data

percentage of 25%. The following are Japanese customers, 9.4%. The customers

from Italy consist of 6.2% of the total customers. There are customers from

Denmark, France, Norway, The Netherlands, UK, China and Taiwan, they each

have 3.1% percentage of the customers. Therefore, Europe is the main customer

market of XploRe, followed by America and Asia.

78.1% of XploRe customers are men. Women have a relativelyly lower percentage,

only 21.9%. This is in correspondence with the facts of the XploRe users.

English is the main language used among the customers, followed by German,

French and Italian.

The customer of XploRe are highly intellectual, 21.8% of them own the title of

Prof., Dr, or Prof.Dr..

34.4% of customers work in research institutes. 3.1% of them work in companies.

Windows is the most popular platform. 21.3% of the customers use Windows as

their computing platform.

The professional fields, in which the customers work, are diverse. Econometrics

has a higher percentage of 9.4% among them. The other professional fields indi-

cated in the data are statistics, biostatistics, mathema

fuzy data mining

Documents