This tutorial will demonstrate the ease of using simMachines’ Ramiel Angel for similarity search on the simMachines platform. If you have a simMachines account and wish to follow the tutorial step-by-step, the data file can be found in this link.
We are going to use Ramiel to look at how a bank’s customers responded to various marketing campaigns. Ramiel will use information about the customers and marketing campaigns to identify similar circumstances and results. This knowledge could then be used to improve future marketing efforts or guide employee / customer interactions.
Create a new folder for this tutorial and upload the data. If you are unfamiliar with this process please see the Platform Navigation Tutorial.
From the “Create Angel” window select Ramiel, then choose the folder in which the data is saved.
Once selected you will see the first few rows of the data. Directly above the data is the columns’ header and a column type selection dropdown.
Next we tell Ramiel about the structure of our data. This ensures that similarities are identified in the correct way. Ramiel accepts the following specifications:
ID A mandatory field which uniquely identifies each object. REAL Numerical values. NOMINAL Values that do not bear a quantitative relationship with each other (i.e., strings and numbers which represent non-numerical information). MULTI_PLAIN Multiple NOMINAL values separated by spaces. Non-language specific. MULTI_ENGLISH Multiple NOMINAL values separated by spaces. The text is English language. MULTI_SPANISH Multiple NOMINAL values separated by spaces. The text is Spanish language. MULTI_JAPANESE Multiple NOMINAL values separated by spaces. The text is Japanese language. ITEM_SET A series of values with weights. (Formatted as item1:weight1;item2:weight2; item3:weight3) IGNORE The column shall be ignored by the program. META This column is for metadata and shall be ignored by the program, but information will be retained in the output.
Giving Ramiel the correct specifications to work with is critical for ensuring that you get ideal responses. Let’s look at a sample of the data and use it to assign the correct specifications.
This file contains data about bank customers and marketing campaigns. Various marketing campaigns were conducted via phone calls to the customers. This table shows the first fifteen lines of the file.
Id age job marital education balance contact day month duration camp pdays prev poutcome 1 30 unemployed married primary 1787 cellular 19 oct 79 1 -1 0 unknown 2 33 services married secondary 4789 cellular 11 may 220 1 339 4 failure 3 35 management single tertiary 1350 cellular 16 apr 185 1 330 1 failure 4 30 management married tertiary 1476 unknown 3 jun 199 4 -1 0 unknown 5 59 blue-collar married secondary 0 unknown 5 may 226 1 -1 0 unknown 6 35 management single tertiary 747 cellular 23 feb 141 2 176 3 failure 7 36 self-employed married tertiary 307 cellular 14 may 341 1 330 2 other 8 39 technician married secondary 147 cellular 6 may 151 2 -1 0 unknown 9 41 entrepreneur married tertiary 221 unknown 14 may 57 2 -1 0 unknown 10 43 services married primary -88 cellular 17 apr 313 1 147 2 failure 11 39 services married secondary 9374 unknown 20 may 273 1 -1 0 unknown 12 43 admin. married secondary 264 cellular 17 apr 113 2 -1 0 unknown 13 36 technician married tertiary 1109 cellular 13 aug 328 2 -1 0 unknown 14 20 student single secondary 502 cellular 30 apr 261 1 -1 0 unknown 15 31 blue-collar married secondary 360 cellular 29 jan 89 1 241 1 failure
Each row of this file corresponds to a customer of the bank. Each entry is comprised of 14 columns:
- 1.ID The customer’s ID. It is a unique numerical code which we will use to identify each row of our data by assigning it the ID column type.
- 2. AGE The customer’s age. Ages are numerical values which can be directly compared to each other, this therefore receives the REAL type.
- 3. JOB The type of job that the customer has. This is a single string (text) value which represents information of a classification, so we will label it NOMINAL.
- 4. MARITAL The customer’s marital status. Like JOB, this is NOMINAL.
- 5. EDUCATION The customer’s education level. Like JOB, this is NOMINAL.
- 6. BALANCE The customer’s average yearly balance, in euros. Amounts of money can be directly compared to each other, this therefore receives the REAL type.
- 7. CONTACT The communication type used in contacting the customer for marketing purposes. Like JOB, this is NOMINAL.
- 8. DAY The day of the month on which the customer was most recently contacted. Since the numbers of a date convey the order of the days this is a REAL.
- 9. MONTH The month in which the customer was most recently contacted. Like JOB, this is NOMINAL.
- 10. DURATION The duration of the most recent contact, in seconds. Durations of time can be directly compared to each other, this therefore receives the REAL type.
- 11. CAMP Which campaign targeted the customer. Like JOB, this is NOMINAL.
- 12. PDAYS The number of days that have passed since the client was last contacted in a campaign (-1 if there has been no contact). Durations of time can be directly compared to each other, this therefore receives the REAL type.
- 13. PREV The number of previous marketing contacts performed on this customer. The number of contacts can be directly compared to each other, this therefore receives the REAL type.
- 14. POUTCOME The outcome of previous marketing efforts, if known. Like JOB, this is NOMINAL.
Angel Specification and Creation
Angels also accept specifications, which can modify their behavior and access to server resources. For the purposes of this tutorial, and for most circumstances, the default parameters are fine, but let’s go over the options.
Storage Units Specifies the amount of memory devoted by the server to this Angel. Larger files or Angels with more strict search parameters may require additional memory. Each unit is 2 GB. (Range 1 to 6, default 1) Parallelism Specifies the number of servers redundantly running the Angel. (Default 2) Ramiel K Specify the number of results for the nearest neighbor search. (Default 10) Pivots The number of primary search points in the engine. (Range 256 to 1024, default 256) Probability Minimum accepted probability for the results, any result with lower probability will be discarded.(Range 0 to 1, default .95) Accepted Error Maximum accepted difference in distance between returned objects and the query object. (Minimum 1, default 1.2)
Once all desired specifications have been entered and you have given your Angel a name, click “Create”.
You will be taken to the “Your Angels” page where you can see the status of the Angels that you have created. Once your Angel’s status is “running” it is ready to answer queries. Depending on the file size it may take a few seconds to a few minutes for an Angel to complete initialization.
Querying Your Angel
You now have a functional Ramiel Angel capable of returning queries. Let’s try it out!
When querying a Ramiel Angel you have two options, Execute Query and Batch Query. Both are accessed from the “Angel Actions” dropdown on the “Your Angels” screen.
You can now submit queries to identify the nearest neighbor of any given object. You can query a new object by entering its values into the available fields, or select an existing object by selecting the folder with your data in the “Choose your folder” dropdown and clicking the “Fill Query Fields” button next to the object.
Let’s generate a query and see Ramiel in action!
Populate the fields with the object whose nearest neighbors you want to find and click “Execute Query”.
The results will appear immediately. Ramiel’s results are ordered by distance with the nearest neighbor first. The ID column contains the ID of the neighboring object, the distance column contains the distance the returned object is from the queried object.
Ramiel also accepts batch queries. A batch query will take a file of objects and return the k nearest neighbors for each object in the file. The output of a batch query will be a CSV file with the object ID in the first column and the k nearest neighbors with the same ID (distance) format used in individual queries.
To run a batch query, select “Batch Query” from the “Angel Actions” dropdown next to the Angel on the “Your Angels” page.
Select the folder containing the file you wish to run a batch query on, name your output file, and specify an output folder. Then select “Execute Query”.
The status of your batch execution is visible at the bottom of the page. When the status is marked “Completed”, click the “Download Results” button to download the results.
Ramiel is simMachines’ lightning fast k-Nearest Neighbor similarity search engine. Ramiel takes a specified object and finds the k most similar objects in a dataset. k-Nearest Neighbor is an incredibly powerful method of analyzing massive amounts of data, but its use it typically limited by the prohibitively high computational costs suffered as the number of dimensions to be measured increases. simMachines’ unique approach overcomes this barrier and provides a consistent response time of a fraction of a second without a decline in accuracy.