2
$\begingroup$

I have big dataset (hundreds of millions of records, counted in dozens of GBs) and I would like to perform LOF for the problem of anomaly detection (testing different methods for academic purposes) training on this dataset and then test it on smaller labeled dataset to check accuracy of method. As it is hard to fit all the data at once is there any implementation allowing me to train it in batches? How would you approach it?

$\endgroup$

1 Answer 1

2
$\begingroup$

How about sampling a subset of that data, and using that 'exploration set' for your initial comparisons? That's what I would first consider. You are undertaking an exploratory analysis that will require a lot of iteration - having a good sample that you can efficiently run locally is very important in my experience.

I would use stratified sampling to ensure that your exploration set is nonetheless representative of the original data's distribution.

Since your ultimate goal is to apply an algorithm to big data, you might want to rule out algorithms that scale poorly, even they perform well. For example, if an algorithm doesn't have a batched/online implementation (.partial_fit() in sklearn), a parallelised implementation, or a GPU implementation, then it might not be worth spending much time considering it. I would still include it in an initial comparison on the exploration set to get a feel for the data and how it interacts with different algorithms.

I think a strong candidate here would also be neural nets coded up in PyTorch or similar, since they can efficiently handle and learn from large datasets.

$\endgroup$
2
  • $\begingroup$ I guess you are right - the only question is how would you perform stratified sampling in case of having dependent records (data are consisted of transactions which often originate from the same person). $\endgroup$ Commented Jan 6 at 17:56
  • $\begingroup$ sklearn has various splitters, including StratifiedGroupKFold which will stratify whilst ensuring that groups (e.g. people) that appear in the training set will not appear in the validation set. $\endgroup$ Commented Jan 6 at 18:17

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.