This tutorial will demonstrate how to use the simMachines’ Leliel Angel to predicting future outcomes by comparing the queried event against known outcomes of previous events. If you have a simMachines account and wish to follow the tutorial step-by-step, the data file can be found in this link.
This tutorial will demonstrate Leliel’s classification capabilities in a retail setting. Leliel will predict whether or not a customer is likely to return a product that they purchased. Such information can play a critical role in improving supply chain efficiency, as well as targeting sales efforts.
Create a new folder for this tutorial and upload the data. If you are unfamiliar with this process please see the Platform Navigation Tutorial.
From the “Create Angel” window select Leliel, then choose the folder in which the data is saved.
Once selected you will see the first few rows of the data. Directly above the data is the columns’ header and a column type selection dropdown.
Next we tell Leliel about the structure of our data. This ensures that similarities are identified in the correct way. Leliel accepts the following specifications:
ID A mandatory field which uniquely identifies each object. CLASS A mandatory field which specifies the field to be classified. REAL Numerical values. NOMINAL Values that do not bear a quantitative relationship with each other (i.e., strings and numbers which represent non-numerical information). MULTI_PLAIN Multiple NOMINAL values separated by spaces. Non-language specific. MULTI_ENGLISH Multiple NOMINAL values separated by spaces. The text is English language. MULTI_SPANISH Multiple NOMINAL values separated by spaces. The text is Spanish language. ITEM_SET A series of values with weights. (Formatted as item1:weight1;item2:weight2; item3:weight3) IGNORE The column shall be ignored by the program. META This column is for metadata and shall be ignored by the program, but information will be retained in the output.
Giving Leliel the correct specifications to work with is critical for ensuring that you get ideal responses. Let’s look at a sample of the data and use it to assign the correct specifications.
This file contains data about customers, their purchases, and whether or not the purchase was returned. The table shows the first fifteen lines of the file. (Due to the number of columns you may have to scroll right to see the entire file.)
orderItemID orderDate deliveryDate itemID size color manufacturerID price customerID salutation dateOfBirth state creationDate returnShipment 1 2012-04-01 2012-04-03 186 m denim 25 69.90 794 Mrs 1965-01-06 Baden-Wuerttemberg 2011-04-25 0 2 2012-04-01 2012-04-03 71 9+ ocher 21 69.95 794 Mrs 1965-01-06 Baden-Wuerttemberg 2011-04-25 1 3 2012-04-01 2012-04-03 71 9+ curry 21 69.95 794 Mrs 1965-01-06 Baden-Wuerttemberg 2011-04-25 1 4 2012-04-02 ? 22 m green 14 39.90 808 Mrs 1959-11-09 Saxony 2012-01-04 0 5 2012-04-02 1990-12-31 151 39 black 53 29.90 825 Mrs 1964-07-11 Rhineland-Palatinate 2011-02-16 0 6 2012-04-02 1990-12-31 598 xxl brown 87 89.90 825 Mrs 1964-07-11 Rhineland-Palatinate 2011-02-16 0 7 2012-04-02 1990-12-31 15 39 black 1 129.90 825 Mrs 1964-07-11 Rhineland-Palatinate 2011-02-16 0 8 2012-04-02 2012-04-03 32 xxl brown 3 21.90 850 Mrs 1948-04-08 North Rhine-Westphalia 2011-02-16 1 9 2012-04-02 2012-04-03 32 xxl red 3 21.90 850 Mrs 1948-04-08 North Rhine-Westphalia 2011-02-16 1 10 2012-04-02 2012-04-03 57 xxl green 3 39.90 850 Mrs 1948-04-08 North Rhine-Westphalia 2011-02-16 1 11 2012-04-02 2012-04-03 2 xxl mocca 2 39.90 850 Mrs 1948-04-08 North Rhine-Westphalia 2011-02-16 1 12 2012-04-02 2012-04-03 259 39 black 1 119.90 850 Mrs 1948-04-08 North Rhine-Westphalia 2011-02-16 1 13 2012-04-02 2012-04-03 603 39 black 55 169.90 850 Mrs 1948-04-08 North Rhine-Westphalia 2011-02-16 1 14 2012-04-02 2012-04-10 259 39 ocher 1 119.90 850 Mrs 1948-04-08 North Rhine-Westphalia 2011-02-16 1 15 2012-04-02 2012-04-03 165 37 mocca 47 89.90 858 Mrs ? Berlin 2012-03-29 1
This file has fourteen columns. The first column is the unique order ID. The next seven columns specify information about a product that has been bought. The next five columns show information about the customer who made the purchase. The last column specifies if the product was returned or not, where 1 means “returned” and 0 means “not returned”.
Leliel will be used to determine values for column 14. Let’s assign the following specifications:
orderItemID Uniquely identifies the order – ID orderDate For now let’s assume that the date the order was placed does not impact the chance of it being returned – IGNORE deliveryDate Corresponds to orderDate – IGNORE itemID A number that identifies the item purchased – NOMINAL size A number that corresponds to the item’s size – NOMINAL color A string which describes the item’s color – NOMINAL manufacturerID A number that identifies the item’s manufacturer – NOMINAL price A number which measures the cost of the item – REAL customerID A number that identifies the customer – NOMINAL salutation A string which corresponds to the customer’s gender – NOMINAL dateOfBirth For now let’s assume that the age of the customer does not significantly impact the chance that they will return an item – IGNORE state The name of the geographical location in which the transaction occurred – NOMINAL creationDate Corresponds to the date in which the record was produced – IGNORE returnShipment Whether or not the item was returned. This is what we intend to determine with Leliel – CLASS
Angel Specification and Creation
Angels also accept specifications, which can modify their behavior and access to server resources. Setting the correct specifications can significantly impact your results. For now we will use the default parameters, but later we will go over using Grids to identify the ideal specifications.
T(op columns) The number of columns to consider. Note that columns with strings, such as Multi_English, can be divided into multiple columns for this purpose. L(ength) The total number of classes to consider. Energy Weight Useful if one classification is expected to be a significantly larger proportion of the results. Accepts boolean values. Default TRUE. Dense Mode Impacts the method various weights are considered. Accepts DEFAULT or EXCEEDS. Bins Specifies the number of ranges to be used in organizing real numbers. Parallelism Specifies the number of servers redundantly running the Angel. (Default 2) Ramiel K Specify the number of results for the nearest neighbor search. (Default 10) Pivots The number of primary search points in the engine. (Range 256 to 1024, default 256) Probability Minimum accepted probability for the results, any result with lower probability will be discarded.(Range 0 to 1, default .95) Accepted Error Maximum accepted difference in distance between returned objects and the query object. (Minimum 1, default 1.2) CK Classifier K, the number of nearest neighbors used in making the classification.
Once all desired specifications have been entered and you have given your Angel a name, click “Create”.
You will be taken to the “Your Angels” page where you can see the status of the Angels that you have created. Once your Angel’s status is “running” it is ready to answer queries. Depending on the file size it may take a few seconds to a few minutes for an Angel to complete initialization.
Querying Your Angel
You now have a functional Leliel Angel capable of returning queries. Let’s try it out!
When querying a Leliel Angel you have two options, Execute Query and Batch Query. Both are accessed from the “Angel Actions” dropdown on the “Your Angels” screen.
You can now submit queries to identify the nearest neighbor of any given object. You can query a new object by entering its values into the available fields, or select an existing object by selecting the folder with your data in the “Choose your folder” dropdown and clicking the “Fill Query Fields” button next to the object.
Let’s generate a query and see Leliel in action!
Populate the fields with the object whose nearest neighbors you want to find and click “Execute Query”.
The results will immediately appear in the area above the “Execute Query” button.
The first value is the classification.
The second value is the “energy” score of the classification, from 0 – 2, higher being better. This is a measure of the distance between the target object and the identified nearest neighbors.
The third value is Leliel’s confidence in the result, again on a 0 – 2 scale. This is based on how the identified nearest neighbors are classified.
The final value is the probability that the specified class is correct amongst the returned classes. (If there are multiple likely classes Leliel will suggest each and present their respective probabilities.)
Leliel doesn’t just provide results in the form of raw numbers, it also shows you how those numbers are generated by displaying the nearest neighbors to your query.
This information is presented just below the results section. To use this function, first select which classification you want to see. In our example we only have one classification available since Leliel was so confident in the results.
Leliel then presents the columns that were most important to determining the classification.
You can then select from the objects which were identified as the nearest neighbors to the query and see how they compare.
Testing and Optimizing Leliel
So how accurate is the Angel that we made? And how can we optimize an Angel to get the most accurate results? To answer these questions we use Fold Experiments and Grids.
Fold Experiments are a method of testing the accuracy of your Angel. They test portions of your data and report the Angels accuracy at various confidence levels.
Grids are multiple Fold Experiments, each run with a varied Angel Specifications set. You can use a Grid to find the ideal Angel Specifications for your classifier.
See the Grids And Fold Experiments Tutorial to learn how to apply these features to your Leliel Angel.
Leliel also accepts batch queries. A Leliel batch query takes a csv file containing unclassified objects and returns a csv file with those objects classified in the same result, score, confidence, class probability format as a standard query.
To run a batch query, select “Batch Query” from the “Angel Actions” dropdown next to the Angel on the “Your Angels” page.
Select the folder containing the file you wish to run a batch query on, name your output file, and specify an output folder. Then select “Execute Query”.
The status of your batch execution is visible at the bottom of the page. When the status is marked “Completed”, click the “Download Results” button to download the results.
Leliel is simMachines’ classification Angel; it assigns tags or labels to an object based on its characteristics. Built on the Ramiel k-nearest neighbor engine, Leliel predicts the value for an unknown vector in a queried object by identifying the most similar objects with known values for that vector. In addition to its unmatched performance, Leliel provides powerful optimization and accountability tools unseen on other platforms.