This tutorial will demonstrate how to use the simMachines’ Leliel Angel to predicting future outcomes by comparing the queried event against known outcomes of previous events. If you have a simMachines account and wish to follow the tutorial step-by-step, the data file can be found in this link.
This tutorial will demonstrate Leliel’s classification capabilities in a business environment. Leliel will predict whether or not an individual has an income less or greater than $50,000 based on other information known about them. We will first cover how to create and use a Leliel Angel, then we will move on to testing and optimizing that Angel.
Create a new folder for this tutorial and upload the data. If you are unfamiliar with this process please see the Platform Navigation Tutorial.
From the “Create Angel” window select Leliel, then choose the folder in which the data is saved.
Once selected you will see the first few rows of the data. Directly above the data is the columns’ header and a column type selection dropdown.
Next we tell Leliel about the structure of our data. This ensures that similarities are identified in the correct way. Leliel accepts the following specifications:
ID A mandatory field which uniquely identifies each object. CLASS A mandatory field which specifies the field to be classified. REAL Numerical values. NOMINAL Values that do not bear a quantitative relationship with each other (i.e., strings and numbers which represent non-numerical information). MULTI_PLAIN Multiple NOMINAL values separated by spaces. Non-language specific. MULTI_ENGLISH Multiple NOMINAL values separated by spaces. The text is English language. MULTI_SPANISH Multiple NOMINAL values separated by spaces. The text is Spanish language. MULTI_JAPANESE Multiple NOMINAL values separated by spaces. The text is Japanese language. ITEM_SET A series of values with weights. (Formatted as item1:weight1;item2:weight2; item3:weight3) IGNORE The column shall be ignored by the program. META This column is for metadata and shall be ignored by the program, but information will be retained in the output.
Giving Leliel the correct specifications to work with is critical for ensuring that you get ideal responses. Let’s look at a sample of the data and use it to assign the correct specifications.
This file contains data about individuals’ personal, educational, and economic attributes. The table shows the first fifteen lines of the file. (Due to the number of columns you may have to scroll right to see the entire table.)
IDENTIFICATION AGE WORKCLASS EDUCATION MARITAL-STATUS OCCUPATION RELATIONSHIP RACE SEX HOURS-PER-WEEK COUNTRY INCOME 2 50 Self-emp-not-inc Bachelors Married-civ-spouse Exec-managerial Husband White Male 13 United-States <=50K 3 38 Private HS-grad Divorced Handlers-cleaners Not-in-family White Male 40 United-States <=50K 4 53 Private 11th Married-civ-spouse Handlers-cleaners Husband Black Male 40 United-States <=50K 5 28 Private Bachelors Married-civ-spouse Prof-specialty Wife Black Female 40 Cuba <=50K 6 37 Private Masters Married-civ-spouse Exec-managerial Wife White Female 40 United-States <=50K 7 49 Private 9th Married-spouse-absent Other-service Not-in-family Black Female 16 Jamaica <=50K 8 52 Self-emp-not-inc HS-grad Married-civ-spouse Exec-managerial Husband White Male 45 United-States >50K 9 31 Private Masters Never-married Prof-specialty Not-in-family White Female 50 United-States >50K 10 42 Private Bachelors Married-civ-spouse Exec-managerial Husband White Male 40 United-States >50K 11 37 Private Some-college Married-civ-spouse Exec-managerial Husband Black Male 80 United-States >50K 12 30 State-gov Bachelors Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 40 India >50K 13 23 Private Bachelors Never-married Adm-clerical Own-child White Female 30 United-States <=50K 14 32 Private Assoc-acdm Never-married Sales Not-in-family Black Male 50 United-States <=50K 15 40 Private Assoc-voc Married-civ-spouse Craft-repair Husband Asian-Pac-Islander Male 40 Jamaica >50K
This file has 11 columns. The first column is the unique identification number. The next ten columns specify information about the individual such as their age, relationship status, and type of employment. The final column specifies whether or not they make greater than $50,000.
Leliel will be used to determine values for column 11. Let’s assign the following specifications:
Identification Uniquely identifies the individual – ID Age Ages are numbers which can be directly compared – REAL WorkClass Describes the type of employer the individual has – NOMINAL Education Corresponds to the highest level of education obtained by the individual – NOMINAL Marital-Status Describes the Marital Status of the individual – NOMINAL Occupation Describes the type of work conducted by the individual – NOMINAL Relationship The individual’s relationship to the owner of their residence – NOMINAL Race The individual’s race – NOMINAL Sex The individual’s sex – NOMINAL Hours-Per-Week The number of hours per week worked – REAL Country The individual’s place of birth – NOMINAL Income This is what we intend to determine with Leliel – CLASS
Angel Specification and Creation
Angels also accept specifications, which can modify their behavior and access to server resources. Setting the correct specifications can significantly impact your results. For now we will use the default parameters, but later we will go over using Grids to identify the ideal specifications.
T(op columns) The number of columns to consider. Note that columns with strings, such as Multi_English, can be divided into multiple columns for this purpose. L(ength) The total number of classes to consider. Energy Weight Useful if one classification is expected to be a significantly larger proportion of the results. Accepts boolean values. Default TRUE. Dense Mode Impacts the method various weights are considered. Accepts DEFAULT, SMART, MARQ3 or EXCEEDS. Bins Specifies the number of ranges to be used in calculating the similarity of REAL columns. Parallelism Specifies the number of servers redundantly running the Angel. (Default 2) Ramiel K Specify the number of results for the nearest neighbor search. (Default 10) Pivots The number of primary search points in the engine. (Range 256 to 1024, default 256) Probability Minimum accepted probability for the results, any result with lower probability will be discarded.(Range 0 to 1, default .95) Accepted Error Maximum accepted difference in distance between returned objects and the query object. (Minimum 1, default 1.2) CK Classifier K, the number of nearest neighbors used in making the classification.
Once all desired specifications have been entered and you have given your Angel a name, click “Create”.
You will be taken to the “Your Angels” page where you can see the status of the Angels that you have created. Once your Angel’s status is “running” it is ready to answer queries. Depending on the file size it may take a few seconds to a few minutes for an Angel to complete initialization.
Querying Your Angel
You now have a functional Leliel Angel capable of returning queries. Let’s try it out!
When querying a Leliel Angel you have two options, Execute Query and Batch Query. Both are accessed from the “Angel Actions” dropdown on the “Your Angels” screen.
You can now submit queries to identify the nearest neighbor of any given object. You can query a new object by entering its values into the available fields, or select an existing object by selecting the folder with your data in the “Choose your folder” dropdown and clicking the “Fill Query Fields” button next to the object.
Let’s generate a query and see Leliel in action!
Populate the fields with the object whose nearest neighbors you want to find and click “Execute Query”.
The results will immediately appear in the area above the “Execute Query” button.
The first value is the predicted classification.
The second value is the probability that the specified class is correct amongst the returned classes. If there are multiple likely classes Leliel will show each and present their respective probabilities.
Leliel doesn’t just provide results in the form of a prediction, it also shows you the basis for that prediction by displaying the nearest neighbors to your query and the dominant factors used in the prediction.
This information is presented in the tabs in the Query Results screen.
You can see the underlying factors of this prediction in the Justification screen. Leliel presents the factors that were most important to determining the classification. You can see the factors of other possible classifications by selecting from the dropdown menu.
In the Hypothesis tab you can see the strength of the factors for each possible classification. This information is presented as both a Radar and Bar graph.
In the neighbors tab you can see the objects that were identified as the nearest neighbors to the query and see how they compare. In this case we see that there was an exact match.
Testing and Optimizing Leliel
So how accurate is the Angel that we made? And how can we optimize an Angel to get the most accurate results? To answer these questions we use Fold Experiments and Grids.
Fold Experiments are a method of testing the accuracy of your Angel. They test portions of your data and report the Angels accuracy at various confidence levels. Let’s run one on our Angel to see how accurate it is.
To create a Fold Experiment for an Angel that has already been created, go to the “Your Angels” page, and under “Angel Actions” select “Fold Experiment”.
You will be brought to the Fold Experiments folder for the Angel. A default 10 fold experiment is sufficient for our purposes. Name your experiment and click “Execute Experiment”.
Fold experiments can take a few minutes to run as your data is run through multiple times. You can track the status of the experiment by selecting “View” under the “Fold Progress” column.
When your Fold Experiment is complete you can download the results in csv format by clicking “Download Results” or you can view them on the simMachines platform by clicking “Visualization”.
Let’s look at the visualization to understand what the Fold Experiment is telling us.
The Fold Experiment functions by stripping the information from the column to be classified for a portion of your data and having your Leliel Angel attempt to classify that data, the experiment compares Leliel’s results to the known classifications and determines the accuracy.
The first section is a graph detailing the accuracy of the Angel at various confidence levels.
P90 is the accuracy of the Angel when it is 90% confident, P80 is the accuracy of queries which Leliel is at least 80% confident, and so on to P0, Leliel’s accuracy for the entire file regardless of confidence levels.
We can see that at P0 our Angel is around 78% accurate, while at P90 it is around 88% accurate.
The confusion matrix tells us what type of mistakes Leliel was making. We can see that at P0 Leliel correctly classified 90% of the <=50K objects as <=50K, but incorrectly classified 45% of >50K as <=50K.
At P90 this improved to incorrectly classifying less than 25% of the >50K objects as <=50K. And Leliel achieved 93% accuracy on the <=50K objects!
At the bottom of the page you can see how the Angel classified each object in each fold.
Accuracy of 78% at P0 is pretty good, but I bet with a little tweaking of our Angel Specifications we could do better. To find the ideal specifications let’s make a Grid.
Creating a Grid
Grids are multiple Fold Experiments, each run with a varied set of Angel Specifications. You can use a Grid to find the ideal Angel Specifications for your classifier.
Making a Grid is as easy as making a new Leliel Angel, we simply change one step. Go through the steps to create a new Leliel Angel and choose the folder with your data and assign the column types as normal. Now, to make this a Grid rather than a typical Angel, click “Create Grid” at the top of the Angel Specification section.
The different Angel Specifications that the Grid will test are denoted with semicolons between the variables. You can adjust these to fine tune your Grid. Be aware that the more variables used, the longer your Grid will take to run. For now, let’s tell the Grid that we want to stick with a Lengths List of 2, since we know that we want two classifications, but set CK to 1;3;5 to see if having access to more neighbors helps the results.
Name your Grid and confirm the specifications, then click “Create”.
You can see the status of your Grid on the “Your Angels” page. A Grid can take some time to complete since it is creating hundreds of Fold Experiments.
A more detailed view of the status of your Grid can be seen by selecting “Grid Results” from the Angel Action dropdown.
The Grid is a collection of Fold Experiments. You can see the status of each Fold Experiment being run, and can see a detailed status of the experiment my selecting “View” under the “Fold Progress” column.
Once your Grid is complete we can use it to identify the most accurate Angel Specifications. This Grid ran 108 Fold Experiments, each with different settings. You can see the settings used in each experiment by scrolling right.
Columns can be sorted by clicking on the up/down arrows next to the column name. Let’s sort by Mean P0 score.
Grid 77 has an 81% P0 accuracy, a nice step up from the 78% we had before. And it has a 90% P90.
Grids provide breakdowns of the mean and minimum accuracy for each section of the Fold Experiment, along with information on how many queries were covered by that experiment. For Grid 77 we can see that 1951 of the 3255 items were covered by P90.
Deciding which test provides the optimal results may not always be straightforward. For example, you may find that the most accurate P90 doesn’t include as many results as slightly less accurate versions.
Once you decide which Angel provides the best results, simply click the “Create Angel” button in the leftmost column and you will be brought to the “Create Angel” page with the Angel Specifications already filled out.
Leliel also accepts batch queries. A Leliel batch query takes a csv file containing unclassified objects and returns a csv file with those objects classified in the same result, score, confidence, class probability format as a standard query.
To run a batch query, select “Batch Query” from the “Angel Actions” dropdown next to the Angel on the “Your Angels” page.
Select the folder containing the file you wish to run a batch query on, name your output file, and specify an output folder. Then select “Execute Query”.
The status of your batch execution is visible at the bottom of the page. When the status is marked “Completed”, click the “Download Results” button to download the results.
Leliel is simMachines’ classification Angel; it assigns tags or labels to an object based on its characteristics. Built on the Ramiel k-nearest neighbor engine, Leliel predicts the value of an unknown vector in a queried object by identifying the most similar objects with known values for that vector. In addition to its unmatched performance, Leliel provides powerful optimization and accountability tools unseen on other platforms.