Comptage probabiliste entre mathematique et informatique
40 Pages
English
Gain access to the library to view online
Learn more

Comptage probabiliste entre mathematique et informatique

-

Gain access to the library to view online
Learn more
40 Pages
English

Description

LIPN NOV 2006 Comptage probabiliste : entre mathematique et informatique Philippe Flajolet, INRIA, Rocquencourt 1

  • outgoing flows

  • comptage probabiliste

  • estimate n?

  • ghfffghfghgghggggghghheehfhfhhgghghghhfgffffhhhiigfhhffgfiihfhhh igigighfgihfffghigihghigfhhgeegeghgghhhgghhfhidiigihighihehhhfgg

  • loglog counting

  • distinct words

  • very little

  • end estimate


Subjects

Informations

Published by
Published 01 November 2006
Reads 20
Language English

Exrait

LIPN
NOV 2006
Comptage probabiliste : entre
´
mathematique et informatique
Philippe Flajolet, INRIA, Rocquencourt
http://algo.inria.fr/flajolet
114
Routers in the rangeofTerabits/sec (10 b/s).
Google indexes 6 billion pages & prepares to
17
index 100 Petabytes ofdata (10 B).
What aboutcontent?
Message: Can get a few key characteristics, QUICK and EASY
Combinatorics +algorithms +probabilities +analysis areuseful!
2FromEstan-Varghese-Fisk: tracesofattacks
Neednumberofactiveconnectionsin time slices.
Incoming/Outgoingflows at 40Gbits/second.
Code RedWorm: 0.5GBytesof compressed data perhour (2001).
CISCO: in11 minutes,a worm infected500,000,000 machines.
3ThesituationislikelisteningtoaplayofShakespeareandatthe
end estimate the number of different words.
Rules: Very little computation per element scanned, very little
auxiliarymemory.
FromDurand-Flajolet,LOGLOG Counting(ESA-2003):
WholeofShakespeare,m = 256small“bytes”of4bitseach=128bytes
ghfffghfghgghggggghghheehfhfhhgghghghhfgffffhhhiigfhhffgfiihfhhh
igigighfgihfffghigihghigfhhgeegeghgghhhgghhfhidiigihighihehhhfgg
hfgighigffghdieghhhggghhfghhfiiheffghghihifgggffihgihfggighgiiif
fjgfgjhhjiifhjgehgghfhhfhjhiggghghihigghhihihgiighgfhlgjfgjjjmfl

Estimaten ≈ 30,897vsn = 28,239distinct words. Error: +9.4% w/128 bytes!
4Uses:
— Routers: intrusion, flowmonitoring & control
0
— Databases: Query optimization, cfM∪M for multisets; Esti-
mating the size of queries & “sketches”.
— Statistics gathering: on the fly, fast and with little memory
evenon“unclean” data' layer 0 of “data mining”.
5This talk:
• Estimatingcharacteristics of large data streams
— sampling;size & cardinality& nonuniformityindex [F ,F ,F ]
1 0 2
; powerofrandomizationvia hashing
Gains bya factorof>400 [Palmer et al.]
• Analysis of algorithms
— generatingfunctions,complex asymptotics,Mellin transforms
Nice problemsfortheoreticians.
• Theory and Practice
— Interplayofanalysisand design; super-optimizedalgorithms.
6Problems on Streams
Given: S = a large streamS = (r ,r ,...,r ) withduplicates
1 2 `
—||S| =length orsize: total # ofrecords (`)
—|S| = cardinality: # of distinct records (c)
♦ Howto estimatesize, cardinality,etc?
X
p
Moregenerally,iff is frequencyofvaluev: F := (f ) .
v p v
v∈D
CardinalityisF ;sizeisF ;F isindicatorofnonuniformityofdistribution;
0 1 2
“F ” is most frequentelement[Alon,Matias, Szegedy, STOC96]

♦ Howto sample?
— withor withoutmultipicity
♦ Howto find icebergs,mice, elephants?
71 ICEBERGS—COMBINATORICS HELPS!
Definition: A k–iceberg is present
in proportion> 1/k.
Onepassdetectionoficebergsfork = 2 using 1 registersis possible.
— Trigger agang war: equipeach individual with agun.
—Each guy shootsa guy froma differentgang,thencommitssuicide:
Majoritygang survives.
—Implementsequentially&adapttok≥ 2withk−1registers. [Karp
et al. 2003]
82 APPROXIMATE COUNTING—THE DYADIC PARADIGM
How to find an integer while posing fewquestions?
— Ask if in [1—2], [2—4], [4—8], [8–16], etc?
— Conclude by binary search: cost is 2log n.
2
A paradigmfor unboundedsearch:
• Ethernet proceeds by period doubling + randomization.
+
• Wake up process for mobile communication [OCAD: Lavault ]
• Adaptive data structures: e.g., extendible hashing tables.
♥Approximate Counting
9Approximate counting—probabilities help!
Theoldest algorithm [Morris CACM:1977], analysis[F, 1985].
MaintainF , i.e., countersubject toC :=C +1.
1
3/4
1/2 7/8
1
1/8
1/2 1/4
ALG: ApproximateCouting
Initialize: X := 1;
−X
Increment: doX :=X +1 with probability 2 ;
X
Output: 2 −2.
Theorem: Counttilln probabilistically using log logn+δ bits, withac-
2
−δ/2
curacy about 0.59·2 .
16
Beatsinformation theory(!?): 8 bitsfor counts≤ 2 w/ accuracy≈ 15%.
10