Very Fast Machine Learning Toolkit

Pedro Domingos and Geoff Hulten developed VFML in 2003 to experiment with applying machine learning techniques to situations where the scale of streaming data being learned from makes traditional techniques impractical. Their original work is described in:

Hulten, G. and Domingos, P. "VFML -- A toolkit for mining high-speed time-changing data streams" http://www.cs.washington.edu/dm/vfml/. 2003.

Very Fast Decision Tree

JVFML is a Java implementation of Hulten and Domingos' Very Fast Decision Tree (VFDT) algorithm for building decision trees from streaming data using a statistical result known as the Hoeffding Bound. The Hoeffding Bound is used to decide when enough data instances have been processed to split a tree node and be confident that a traditional batch learner with all the data available would have made the same decision.

Weka Integration

JVFML is designed to interface with Weka. Although using Weka eliminates a major advantage of VFDT (its ability to process streaming data sets one data instance at a time without ever loading the entire data set into memory), the Weka implementation is potentially a useful tool for experimenting with the algorithm.

Other Implementations

The developers of Weka have also developed a streaming machine learning toolkit called Moa. This software contains an implementation of VFML as well as a number of other stream classifiers.

Domingos and Hulten also have an implementation of VFML in C. Their original source code can be downloaded from the VMFL Sourceforge repository. However, on Ubuntu 12.04 LTS (and possibly other modern Linux distros), the project does not build as-is. A slightly modified version of Domingos and Hulten's original C code which compiles under Ubuntu 12.04 LTS is packaged with JVFML.

Installation Instructions

Download Weka 3.6 from the project's Downloads page and install it. Note: JVFML currently does not work the with Weka version 3.7.
Download the compiled JVFML jar: vfml-weka-1.0.0.jar.
Copy vfml-weka-1.0.0.jar into your Weka application home directory.
Open a terminal (cmd.exe on Windows).
Include vfml-weka-1.0.0.jar on the Java classpath when launching Weka:
java -classpath weka.jar;vfml-weka-1.0.0.jar weka.gui.GUIChooser

Note: Linux users should use : instead of ; when typing the previous command into the terminal.

Note: To work with large data sets, the Java heap space allocated to Weka may need to be increased. To give Weka 2GB of memory (for example) add the following option to the command above `-Xmx2g`.

Usage Instructions

Once Weka has been launched with the JVFML jar on the Java classpath, it can be used like any other Weka classifier. There should be two new classifiers available under weka/classifiers/trees: VFDT and CVFDT (CVFDT is an extension which support adapting to concept drift). Note that both classifiers currently only support nominal attributes and do not support missing values.

Follow these steps as a quick way to get started running VFML using the default Weka data sets:

Open the Weka Explorer.
Open the breast-cancer.arff data file.
Choose Filter > weka/filters/unsupervised/attribute/ReplaceMissingValues then click Apply.
Click on the Classify tab.
Choose Classifier > weka/classifiers/trees/VFDT.
Click Start.

Compilation Instructions

JVFDT uses the build tool Maven. We suggest viewing and editing the code using the Eclipse IDE.

Download Eclipse IDE For Java Developers (the version appropriate for your platform).
Download and extract the JVFML source.
Open Eclipse and select File > Import...
Click Browse... and select vfml-master/weka directory from the downloaded jar.
Select the class Test3.java and hit ctrl+F11 to run the test code.