Abstract:
|
Association rule discovery is one of the prototypical problems in
data mining. In this problem, the input database is assumed to be very
large and most of the algorithms are designed to minimize the number
of scans of the database. Enumerating association rules is usually an
expensive task due to the size of the input database. A proposed
approach for reducing the running time of this process is random
sampling. Of course, any implementation of
an algorithm that uses sampling must solve the problem of determining
which sample size is appropriate. Previous research of sampling for
association rule mining has approached this problem concluding that,
in general, the theoretically obtained sample size bounds are far from
what is observed in practice. In this paper, we try to reduce this
gap between theory and practice. We propose two on-line sampling
algorithms for association rule mining. Our algorithms maintain the
same theoretical guarantees of previous approaches while using a much
smaller number of transactions in most of the cases. In the experiments
we report, this improvement is often by an order of magnitude. |