seminar: neue ansätze der ki

Report

Post on 04-Jan-2016

25 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Seminar: Neue Ansätze der KI. Thema des Referats: SPRINT: A scalable parallel classifier for Data Mining Athina Poppi Uni Dortmund, 4.6.2002 1. Inhaltsverzeichnis. Klassifikation Entscheidungsbaum - PowerPoint PPT Presentation

TRANSCRIPT

Seminar:Neue Ansätze der KI

Thema des Referats:

SPRINT: A scalable parallel classifier for Data Mining

Athina Poppi Uni Dortmund, 4.6.2002 1

Inhaltsverzeichnis

1. Klassifikation

2. Entscheidungsbaum

3. SPRINT

4. Fazit

5. Literatur

Athina Poppi Uni Dortmund, 4.6.2002 2

1.1 Klassifikation

Ziel: Bildung eines Klassifikationsmodells um die Datenzugehörigkeit vorher sagen zu können.

Verschiedene Methoden. Am beliebtesten: Entscheidungsbäume (sie können relativ schnell konstruiert werden, sind einfach zu interpretieren und man erreicht ähnliche, oft auch bessere Genauigkeit)

Anwendung: Kleinzielmarketing, Betrugabfragung und medizinische Diagnose

Athina Poppi Uni Dortmund, 4.6.2002 3

1.2 Klassifikationsmodell Training Set: Datenmenge zur Bildung der Klassifikationsmodells. Training Sample: Einzelne Datensätze. Attribute: continuous (zB Einkommen, Alter) oder categorical (zB Autotyp, Sportart). Continuous # categorical: geordnet # ungeordnet Classifying attribute

Athina Poppi Uni Dortmund, 4.6.2002 4

2.1 Entscheidungsbaum

• Besteht aus mehreren Knoten

• Jeder Knoten ist ein Blatt oder ein Entscheidungsknoten (split point)

• Blatt: repräsentiert eine Klasse

• Split point: Hier wird der Test durchgeführt

Athina Poppi Uni Dortmund, 4.6.2002 5

2.2 Beispiel: Autoversicherung

Tid Age Car Type

Risk

0 23 Family High

1 17 Sports High

2 43 Sports High

3 68 Family Low

4 32 Truck Low

5 20 family High

Training Set Decision tree

Age<25

CarType in{Sports}

High

High Low

Athina Poppi Uni Dortmund, 4.6.2002 6

nein

neinja

3. SPRINT

• Scalable PaRallelizable Indution of decision Trees

• Entwicklung: IBM Almaden• Decision-tree-based classification algorithm• Serial algorithm• Excellent scaleup, speedup and sizeup properties

Athina Poppi Uni Dortmund, 4.6.2002 7

3.1 Serienalgorithmus

• 2 Phasen: growth and prune phase.

• growth phase: Der Baum wird errichtet,also man verteilt die Daten rekursiv.

• prune phase: Der Baum wird gestutzt bzw. verallgemeinert um eine Überanpassung des Baums aufgrund von Ausreissern oder fehlerhafte Daten in den Trainingsdaten zu verhindern. Zeit benötigt: nur ca. 1% der Gesamtlaufzeit bei die Baumbildung

Athina Poppi Uni Dortmund, 4.6.2002 8

3.2 Recursive Tree-growth algorithm

Athina Poppi Uni Dortmund, 4.6.2002 9

Partition (Data S)

if (all points in S are from the same class) then

return;

for each attribute A do

evaluate splits on attribute A;

Use best split found to partition S into S1 and S2;

Partition (S1);

Partition (S2);

Initial call: Partition(TrainingData)

3.3 Datenstrukturen

• Attribute lists: Jeder Eintrag besteht aus einem Attributwert, dem Klassenwert und einem Schlüssel (Tupel Identifier, Tid).

• Histograms: continuous: 2 Histogramms kommunizieren

mit jedem Entscheidungsbaum. Categorical: brauchen nur 1 Histogram

(count matrix).

Athina Poppi Uni Dortmund, 4.6.2002 10

3.4 Splitting a node´s attribute lists

Athina Poppi Uni Dortmund, 4.6.2002 11

Age Class Tid

17 High 1

20 High 5

23 High 0

32 Low 4

43 High 2

68 Low 3

CarType Class Tid

Family High 1

Sports High 5

Sports High 0

Family Low 4

Truck High 2

family Low 3

Age<27.5

1 2

Attribute lists for node 0

Attribute lists for node 1

Age Class Tid

17 High 1

20 High 5

23 High 0

Car Type

Class Tid

Family High 0

Sports High 1

family High 5

3.5 Evaluating continuous split points

Athina Poppi Uni Dortmund, 4.6.2002 12

Age Class Tid

17 High 1

20 High 5

23 High 0

32 Low 4

43 High 2

68 Low 3

Attribute ListPosition of

Cursor in scan

Position 0

Position 3

Position 6

State of class Histograms

0 0

H L

4 2

Cbelow

Cabove

3 0

1 2

4 2

0 0

Cbelow

Cabove

Cbelow

Cabove

3.6 Evaluating categorical split points

Athina Poppi Uni Dortmund, 4.6.2002 13

Car Type

Class Tid

Family High 0

Sports High 1

Sports High 2

Family Low 3

Truck Low 4

family High 5

Attribute List

Count matrix

2 1

2 0

0 1

H L

Family

Sports

truck

3.7 Finding Split points

• Ein Split-Test ist abhängig vom Typ des Attributs.• continuous: A<x, x ist ein Attributwert von der

Wertebereich von A.• categorical: BS, S Teilmenge der Wertemenge von B.• Beste Split Point: teilt am besten die mit diesem

Knoten verbundene Trainingsdaten auf.• Die Güte der Aufteilung sind abhängig von wie gut der

Split die verschiedenen Klassen von einander trennt.

Athina Poppi Uni Dortmund, 4.6.2002 14

3.8 Parallelizing Classification

Athina Poppi Uni Dortmund, 4.6.2002 15

•Growth phase: The primary problem remains finding gut split

points and partitioning the data using the discovered split-points.

•SPRINT: parallelizes quite naturally and efficiently (design).

•Each processor works on only 1/N of the total data

•Finding split-points:similar to the serial version. Differences appears

only before and after the attribute-list partitions are scanned.•Continuous: Differences in Cbelow and Cabove

•Categorical: global count matrix

3.9 Parallel Data Placement

Age Class Tid

17 High 1

20 High 5

23 High 0

Age Class Tid

32 Low 4

43 High 2

68 Low 3

Processor 0

Processor 1

Car Type Class Tid

Family High 0

Sports High 1

Sports High 2

Athina Poppi Uni Dortmund, 4.6.2002 16

Car Type Class Tid

Family Low 3

Truck Low 4

family High 5

3.10 Speedup of SPRINT

Athina Poppi Uni Dortmund, 4.6.2002 17

3.11 Leistung

• Das parallization werden an Primitiven einer 16-node verwendenden Standard-MPI IBM SP2 Kommunikation des Modells 9076 durchgeführt.

• Jeder Nullpunkt hat einen Prozessor, an 62.5MHZ mit 128Mb des Gedächtnisses zu laufen.

• Alle Prozessoren laufen auf AIX-Niveau 4,1 • Obwohl SPRINT langsamer als andere

Algorithmen ist, stellt er ein fast lineares scaleup aus.

Athina Poppi Uni Dortmund, 4.6.2002 18

3.12 Uniprocessor performance

Athina Poppi Uni Dortmund, 4.6.2002 19

4. Fazit

SPRINT ist einer Klassifikationsalgorithmus der ausgezeichnetes scalability ausstellt und in der Lage ist, grosse Datensätze anzufassen, dass andere Algorithmen nicht imstande sind. ABER:

1. Es werden bedeutende Kommunikationsunkosten pro Prozessor vorgestellt.

2. Die Prüfenstruktur (die als hashtable eingeführt wird), ist das intensive Gedächtnis und hat die Grösse des gleichen Auftrages, der die Grösse des Ausgangstrainings einstellt .

Verbesserte Version von SPRINT: ScalPacC

Athina Poppi Uni Dortmund, 4.6.2002 20

5. Literaturliste

• SPRINT: a Scalable Paraller Classifier for Data Mining, John Shafer, Rakesh Agrawal, Manish Mehta, Proceedings of the 22th VLDB Conference Mumbai (Bombay), India, 1996

• Parallele Data Mining Algorithmen, Bearbeiter: Rudi Husser, Betreuer: Ralf Rantzau, Prüfer: Prof. Bernhard Mitschang, Datum: 21.02.02, Uni Stuttgart

• Vorlesung KDD, Ludwig-Maximilians-Universität München, WS 2000/2001

Athina Poppi Uni Dortmund, 4.6.2002 21

top related

seminar: neue ansätze der ki

Documents

die nukleon-nukleon wechselwirkung experimentelle fakten und...

einführung - ansätze - algorithmus seminar semantisches...

kognitive ansätze: albert ellis und die rational-emotive...

1 psychoanalytische interventionstechnik ein Überblick...

kap. 2: modelle und grundlagen der modellierung · •...

haq ki pehcahn

seminar: neue ansätze der künstlichen intelligenz seite...

ansätze des tagging -...

entwicklungsdiagnostik / bep-ki · entwicklungsdiagnostik /...

ad-hoc routing motivation ansätze aus dem festnetz reaktive...

klinische psychologie psychotherapie · tenfrei, sobald sie...

wbz – weiterbildungszentrum - fh-kaernten.at ·...

ursachen von ads neurologische, genetische und...

social commerce - webspotting.de¶ge_-_social... · social...

seminar: theorie und technik verschiedener ...

die vier psychologien der psychoanalyse universität zu...

didaktik seminar 9 - ddi.uni-wuppertal.de · sellschaft...

roz ki khabar

smarte customer journey - aber wie?€¦ · design-trends:...

ki-1980 klima kälte heizung -...