Blog

New: CodeEval Code Comparison (Plagiarism Detection)

CodeEval is now even smarter with the launch of our code comparison (plagiarism detection) engine.

One of the challenges with providing relevant information and realistic code rankings for developers on CodeEval is building in a comprehensive system to protect the community against code plagiarism.

The open web makes sharing code easy and we want to make sure that we're providing information and code rankings that are actually relevant, and ensure that the reputation of the platform and our developers is protected. One of the requirements to do this is some kind of system to address cheating or copying code so that when you see someone's code rank, you be confident that they wrote the code themselves. 

After considering a number of algorithms for finding plagiarism in source code, we've decided to build our custom similarity detection engine based on the most current academic research in the area of "Winnowing". Here's one example of the research we took a look at: Winnowing: Local Algorithms for Document Fingerprinting

While we're not going to discuss everything we're doing or exactly how we do it... the gist of it is that every submission of code goes through an analyzer that splits the source code to lexemes (a basic lexical unit of a language, consisting of one word or several words, considered as an abstract unit, and applied to a family of words related by form or meaning.)  We get rid of dependence in the names of variables, classes, etc., then we apply the hashing algorithm and the principle of minimum hash. We choose an imprint that characterizes the source code then we compare the prints with each other, if they're similar - it means the code was duplicated.

A few of the features this supports:

  • You can organize a database to accelerate the check one-against-everyone.
  • Works across all of our 18 supported programming languages.
  • Benefits of the tokenized representation - So we automatically ignore the names of functions and variables (classes, objects, and so on). The tokenization prevents the impact of small changes in the program code to the code duplication checking.
  • Moving a small pieces of code in the source code can slightly affect the result of duplication searching.
  • The algorithm is insensitive to permutations of chunks of code.
  • Since the system compares new code to all existing solutions, finding solutions online and submitting them (even if they're manipulated) are easily identified.

Of course, while displaying this information publicly on profiles this might be a little sensitive for some developers, it's not intended to offend anyone... it's designed to give you some insight into your code, add some credibly to your profile, and to protect the community from those who are cheating or trying to game the rankings.

The comparison engine has been running behind CodeEval for a while now while we thought through how we wanted to present this information to developers. We've tried a number of things and learned a lot. There are benefits that come with simplicity and in looking at the results that were returned in tests it becomes fairly evident who has copied code (even if they've manipulated it) and who has written unique code that has some similarity since it's solving the same challenge (as could be expected). There are some thumbs on the scale in different regards. For example; simpler solutions have a higher probability of returning similar results since the code isn't as long and it's designed to solve a specific challenge so it's likely to have some duplication. Still, while we feel that we have a pretty good idea, we didn't want to be in the position of reporting any false positives. we decided against making any kind of true/false judgement and just deliver the results as a percentage of 'uniqueness' when compared with other submissions, leaving the final judgement with the viewer.

Login to your CodeEval account to see this in action in your account.


How we calculate the percentage...

Whenever any code is submitted for a challenge, we collect the information about duplication in the solutions of a user, we then calculate the ratio of challenges with a duplicate to the total number of solved problems - obtaining the percentage of uniqueness.

*Notes:

  • The percentage of 'not unique' code does not mean that that code was plagiarized, only that it is similar to other submissions (which is to be expected to some extent).
  • We currently don't check the source code for easy level challenges since the code tends to be similar. 
  • We will be continually tweeking this to improve the results so exchanges are to be expected.
  • While we keep all of the submissions for each coding challenge, we only use the results from your last successful submission. So, if you have a result that is less than favorable, you can write original code and improve your uniqueness score.

Feel free to email us for feedback or leave a comment below.

-CodeEval Team

 

Additional References:

  1. Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan. Searching the web. ACM Transactions on Internet Technology (TOIT), 1(1):2–43,  2001.

  2. Brenda S. Baker. On finding duplication and near-duplication in large software systems. In L. Wills, P. Newcomb, and E. Chikofsky, editors, Second Working Conference on Reverse Engineering, pages 86–95, Los Alamitos, California, 1995. IEEE Computer Society Press.

  3. Brenda S. Baker and Udi Manber. Deducing similarities in java sources from byte codes. In Proc. of Usenix Annual Technical Conf., pages 179–190, 1998.

  4. Sergey Brin, James Davis, and Hector Garcıa-Molina. Copy detection mechanisms for digital documents. In Proceedings of the ACM SIGMOD Conference, pages 398–409, 1995.

  5. Andrei Broder. On the resemblance and containment of documents. In SEQS: Sequences ’91, 1998.

  6. Andrei Broder, Steve Glassman, Mark Manasse, and Geoffrey Zweig. Syntactic clustering of the web. In Proceedings of the Sixth International World Wide Web Conference, pages 391–404, April 1997.

  7. Nevin Heintze. Scalable document fingerprinting. In 1996 USENIX Workshop on Electronic Commerce, November 1996.

  8. Richard M. Karp and Michael O. Rabin. Pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249–260, 1987.

  9. Udi Manber. Finding similar files in a large file system. In Proceedings of the USENIX Winter 1994 Technical Conference, pages 1–10, San Francisco, CA, USA, 17–21 1994.

  10. Peter Mork, Beitao Li, Edward Chang, Junghoo Cho, Chen Li, and James Wang. Indexing tamper resistant features for image copy detection, 1999. URL: citeseer.nj.nec.com/mork99indexing.html.

  11. Narayanan Shivakumar and Hector Garcıa-Molina. SCAM: A copy detection mechanism for digital documents. In Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries, 1995.

  12. Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249–260, 1995.