One of the challenges with providing relevant information and realistic code rankings for developers on CodeEval is building in a comprehensive system to protect the community against code plagiarism.
The open web makes sharing code easy and we want to make sure that we're providing information and code rankings that are actually relevant, and ensure that the reputation of the platform and our developers is protected. One of the requirements to do this is some kind of system to address cheating or copying code so that when you see someone's code rank, you be confident that they wrote the code themselves.
After considering a number of algorithms for finding plagiarism in source code, we've decided to build our custom similarity detection engine based on the most current academic research in the area of "Winnowing". Here's one example of the research we took a look at: Winnowing: Local Algorithms for Document Fingerprinting
While we're not going to discuss everything we're doing or exactly how we do it... the gist of it is that every submission of code goes through an analyzer that splits the source code to lexemes (a basic lexical unit of a language, consisting of one word or several words, considered as an abstract unit, and applied to a family of words related by form or meaning.) We get rid of dependence in the names of variables, classes, etc., then we apply the hashing algorithm and the principle of minimum hash. We choose an imprint that characterizes the source code then we compare the prints with each other, if they're similar - it means the code was duplicated.