Ramiel Tutorial (Adult Income Census)

In this tutorial we are going to create a similarity search system that is going to find people with closely resembling characteristics. If you want to follow the tutorial step by step you can download the example data file in this link.

Angel Description

Ramiel is a similarity search engine that provides a type of query that is not available in traditional databases: it can search for the closest k elements given an arbitrary similarity criteria. Our technology is the fastest available. Ramiel provides outstanding performance and ease of use. Our engine is a natural choice for those who require to handle large datasets, the goldmines of our times.

 


File Specification:

REAL Real values.
NOMINAL String values or what you would consider an ENUM.
MULTI_ENGLISH Free form text. The text is english language.
MULTI_SPANISH Multiple NOMINAL values separated by space. The text is spanish language.
MULTI_JAPANESE Multiple NOMINAL values separated by spaces. The text is Japanese language.
ITEM_SET A series of values with weights. (Formatted as item1:weight1;item2:weight2;item3:weight3)
IGNORE The column shall be ignored by the program.
ID Your internal object ID, useful when returning results.
META You can store arbitrary information in a string. Currently the meta data is a reserved
column type.

 

The file should consider these specifications:

  • The file should be in tsv format, which is a tab separated values = ‘\t’
  • We expect to have a header with the same column names that were specified.
  • We are going to read all files in the folder, all of them should follow this format.
  • TSV Quote Character = ‘ ” ‘
  • TSV Line End = ‘\n’
  • TSV Escape Character= ‘\’

Example:

This file contains data about people information, such as age, occupation, education, marital-status among others. This data was collected by a bank to know if a person earns more or less than 50k per year. The table shows a few lines of the file.

IDENTIFICATION AGE WORKCLASS EDUCATION MARITAL-STATUS OCCUPATION RELATIONSHIP RACE SEX HOURS-PER-WEEK COUNTRY INCOME
2 50 Self-emp-not-inc Bachelors Married-civ-spouse Exec-managerial Husband White Male 13 United-States <=50K
3 38 Private HS-grad Divorced Handlers-cleaners Not-in-family White Male 40 United-States <=50K
4 53 Private 11th Married-civ-spouse Handlers-cleaners Husband Black Male 40 United-States <=50K
5 28 Private Bachelors Married-civ-spouse Prof-specialty Wife Black Female 40 Cuba <=50K
6 37 Private Masters Married-civ-spouse Exec-managerial Wife White Female 40 United-States <=50K
7 49 Private 9th Married-spouse-absent Other-service Not-in-family Black Female 16 Jamaica <=50K
8 52 Self-emp-not-inc HS-grad Married-civ-spouse Exec-managerial Husband White Male 45 United-States >50K
9 31 Private Masters Never-married Prof-specialty Not-in-family White Female 50 United-States >50K
10 42 Private Bachelors Married-civ-spouse Exec-managerial Husband White Male 40 United-States >50K
11 37 Private Some-college Married-civ-spouse Exec-managerial Husband Black Male 80 United-States >50K
12 30 State-gov Bachelors Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 40 India >50K
13 23 Private Bachelors Never-married Adm-clerical Own-child White Female 30 United-States <=50K
14 32 Private Assoc-acdm Never-married Sales Not-in-family Black Male 50 United-States <=50K
15 40 Private Assoc-voc Married-civ-spouse Craft-repair Husband Asian-Pac-Islander Male 40 Jamaica >50K

This file has 16 specific columns. The first column is the person ID. The second column is the person age. The third column specify the type of work that the person has. The fourth column specify the education grade of the person. The fifth column is the person marital status. The sixth column specify the ocupation. The seventh column shows the relationship within of the family. The eighth specify the person race. The ninth specify the sex. The tenth column specify the hours per week worked by the person. The eleventh column specify the country of the person and the last column show the earns of the person in the year.

The purpose of this tutorial is to find people with similar characteristics.


Columns Specs:

Our cloud shows the next specs colums and types, but this is a recommendation, you can choose different types for each column depending on your interests:

IDENTIFICATION ID
AGE REAL
WORKCLASS NOMINAL
EDUCATION NOMINAL
MARITAL-STATUS NOMINAL
OCCUPATION NOMINAL
RELATIONSHIP NOMINAL
RACE NOMINAL
SEX NOMINAL
HOURS-PER-WEEK REAL
COUNTRY NOMINAL
INCOM NOMINAL

Angel Parameters Specification:

These are the parameters needed for the angel creation:

Storage Units Specify the angel unit size reserved for creation.
Parallelism Specify the number of replications for the angel that you want to create.
Ramiel K Specify the number of results for the nearest neighbor search.
Pivots The number of primary search points in the engine.
Probability Minimum accepted probability for the results, any result with lower probability will be discarded.
Accepted Error Accepted search error from the distance calculated by the engine and the real distance.

  • Create Folder

    • Click on “Create Folder” to create a container for your csv, tsv or json files that our similarity engine will search.

    • Provide the folder name and click on “Create Folder”.

    • Once the folder is created you will return to a folder list view.

  • Upload File(s)

    • In the “Folder” that you created click on “Upload File” to see the next modal.

    • After choosing your files click on “Upload File”.

    • You can see the progress bar while the file is being uploaded.

    • Once the files are uploaded you will return to a folder list view.

  • Create your Angel

    • Go to “Create Angel” section to choose the angel that works for your project.

    • For this example we are going to create a Ramiel. Click “Create” on Ramiel image.

    • The next step is choosing the folder containing the files that you want to use to train the angel. When you choose the folder you can see a preview of the files.

    • If you want to change the type of any column you can do it by choosing an option in the list of types. For Ramiel is required that exists a column with type ID.

      Then click on “Next” to continue the creation.

    • The next step is to fill the Ramiel parameters (default parameters will work fine) and to choose the name for your angel.

      Click on “Create” to start the creation of the angel.

    • Once the creation started you can see a table with your current angels and the progress of the creation. When the state is running the angel is ready to answer queries.

  • Query Your Angel

    • For the query you have two options, Execute Query and Batch Query, both options can be accessed from “Your Angels” screen.

    • Execute Query

      Provide the values for the object that you want to query and then click on “Execute Query”.

    • Other option is to choose a folder containing the files that you want to use to query the angel.

    • Then you can click on “Fill Query Fields” to choose a row from the file to fill the query object. In this example we fill the fields with the third row showed in the preview of the data.

    • Now you only need to click on “Execute Query” to obtain your result. The result contains a list of the most similar objects identified by their IDs and showing their respective distances to the query object.

    • Batch Query

      First choose a folder containing the files that you want to use to query the angel, and click on “Execute Query”, this is going to create a batch process.

    • Once the batch query has started you can see a table with your current batch files and the progress of the execution. When the state is completed the batch is ready for download.

Manfred CalvoRamiel Tutorial Census Income