ICANN commissioned a security study of those strings in which we inferred over-represented patterns of common domain server requests. The patterns were collected from the DITL data-set. By finding those patterns it is possible to identify world-wide security threats. The data was presented during the OARC 2014 Spring Workshop in Warsaw, Poland.
The patterns lead to the discovery of the JASBUG. More info here.
The abstract of the presentation follows:
Large scale regular expression recognition on the DITL data-set by using similarity search
The day in the life (DITL) data-set is collected to study and improve the integrity of the root server system. Among the different properties recorded in the data-set, we focus on second level domain (SLD) strings. In this study, we introduce a method that automatically infers regular expressions from over-represented SLD strings. At first, we identify random strings and remove them from the data pipeline. Then, we find common string seeds that guide the elucidation process. Finally, we perform similarity search on strings that do not exceed a certain level of entropy level to generate a weight matrix that is then converted into regular expressions and their corresponding visualizations. Similarity search is a very expensive operation, but we manage to achieve fast results by using the simMachines R-01 similarity engine. The method may be used to preemptively discover security or performance issues in the infrastructure. During the talk, we will show a sample of collected regular expressions so that the community may identify familiar and unfamiliar SLD patterns.