Photo: Casey Chin |
The vocabulary you select, your syntax, and your grammatical decisions leave behind a signature. Automated tools can now accurately identify the author of a forum post for example, as long as they have adequate training data to work with. But newer research shows that stylometry can also apply to artificial language samples, like code. Software developers, it turns out, leave behind a fingerprint as well.
Rachel Greenstadt, an associate professor of computer science at Drexel University, and Aylin Caliskan, Greenstadt's former PhD student and now an assistant professor at George Washington University, have found that code, like other forms of stylistic expression, are not anonymous. At the DefCon hacking conference Friday, the pair will present a number of studies they've conducted using machine learning techniques to de-anonymize the authors of code samples. Their work could be useful in a plagiarism dispute, for instance, but it also has privacy implications, especially for the thousands of developers who contribute open source code to the world.
How To De-Anonymize Code
Here's
a simple explanation of how the researchers used machine learning to
uncover who authored a piece of code. First, the algorithm they designed
identifies all the features found in a selection of code samples.
That's a lot of different characteristics. Think of every aspect that
exists in natural language: There's the words you choose, which way you
put them together, sentence length, and so on. Greenstadt and Caliskan
then narrowed the features to only include the ones that actually
distinguish developers from each other, trimming the list from hundreds
of thousands to around 50 or so...
Plagiarism and Privacy Implications
Plagiarism and Privacy Implications
Caliskan
and Greenstadt say their work could be used to tell whether a
programming student plagiarized, or whether a developer violated a
noncompete clause in their employment contract. Security researchers
could potentially use it to help determine who might have created a
specific type of malware.
Read more...
Source: WIRED
Read more...
Source: WIRED