Abstract:
|
One of the biggest research challenges in KDD and Data Mining
is to develop methods that scale up well to large amounts of data.
A possible approach for achieving scalability is to take a random sample and do data mining on it. In this paper, we propose an adaptive sampling method to solve a variety of practically appearing data mining tasks on very large data.
Our algorithms are adaptive in the sense that they determine
from the data whether it has already seen enough
data to reach a reliable conclusion.
We prove the correctness of our method,
estimate its efficiency theoretically, and show its
efficienty experimentally on a concrete task requiring sampling. |