Zero Entropy Language ===================== Idea. Maybe interesting. How to prevent stylometry? Many techniques have been suggested. Most successful technique appears to be "catting". The user imitates another user's writing style. Here is another idea. What if we remove the concept of a "writing style"? What if we use an alternative language, one which only allows one way to say any given phrase? Take the idea: "there is one right way to do it", and apply it to language. Identifying users will still be possible. Of course. This new language removes some bits of entropy, and could be useful. Candidates for this mythical language? Maybe Lojban, or some other unambiguous language. The ideas of removing ambiguity seems correlated. But I have not the knowledge of these conlanguages to decide if it is helpful in reality. Another idea: English, with a limited vocabulary. Limit sentences to the active tense. Limit words to the n most common. One last thought. Maybe this could be combined with machine translation. This way, people can enter their thoughts in English, and get an anonymized version. I need to think about this in more details. Update ====== I did some research on the above techniques. Some updates. Most languages, such as Lojban, won't work for this purpose. These languages still allow for things like different subject-verb orders, which could help deanonymize a user. The lack of good corpuses for most constructed languages would make this difficult. Punctuation turns out to be another stylometric factor. maybe everyone could type like this in run on sentences with no punctuation whatsoever then it would be harder to figure out who was typing what. However word orders and unique phrasings can still give one away, so we need to do more than that. Some automated software, like Hemingway, can suggest word replacements and using the active tense. This might remove some further signals. This software is available as a web application, however the Javascript is simple enough to reverse-engineer. I might make a repository for doing so. A decent amount of research is available on the WWW here: https://www.freehaven.net/anonbib/ The focus is on general anonymity, but some of the content is relevant to stylometry. GAN === Right now my goal is to set up an evaluation framework. I will add some publicly available data to serve as a corpus. (1) Deanonymizer: attempts to "deanonymize" the users. (2) Reanonymizer: attempts to "reanonymize" the users. (I.e., make it so that they cannot be caught by the deanonymizer.) (3) Meaningizer: attempts to check that the reanonymized texts preserve the meaning of the original. It seems that there is a tradeoff between these parts. If we get rid of Meaningizer, then we can anonymize users very easily: just delete all their writing! The goal of the Reanonymizer should be to get the message out while removing common tells which may give the user away. The interaction between the Deanonymizer and the Reanonymizer reminds me of the concept of a generative adversial network in machine learning. Corpus ------ The first question: where to get a corpus? Ideally, such a corpus would be representative of the sort of discussion we want to anonymize. I want to produce an anonymization software for forum posts (as an example). Possible data sources: - PSAL Corpus [psal] - Reddit, other news aggregators - Email dumps or leaks (Enron, Hacking Team, and so on) - Enron dump is in PSAL corpus - TODO: more? [psal]: https://github.com/psal/anonymouth/tree/master/jsan_resources/corpora Deanonymizer ------------ TODO: What techniques are used in the literature to deanonymize users? Possible deanonymizers: - JStylo [jstylo] - Basic-9 - Writeprints - Simple baseline techniques? [jystlo]: https://github.com/psal/jstylo Reanonymizer ------------ TODO: What techniques are used in the literature to anonymize users? (These techniques must be automatic.) Possible reanonymizers: - Anonymouth [anonymouth] Not automatic. Small sample size (because its not anonymous). Focus on maintaining "meaning". - Hemingway [hemingway] Not automatic. Basically a glorified thesaurus. Some of it could be automated. Would lose "meaning". [anonymouth]: https://github.com/psal/anonymouth [hemingway]: http://www.hemingwayapp.com/ Meaningizer ----------- TODO: What techniques in the NLP literature are used to compare meaning? Ideally, these techniques would work well for at the sentence level, but also at the document or paragraph level. A reanonymizer might end up reordering or breaking apart sentences. TODO: one possible techniques would be to have humans evaluate this, for example by using software like MTurk. This could be a good validation technique for whatever metrics we use? Metrics ------- How do we measure if a given deanonymizer is successful? (What should the desired output format of our deanonymizer even be?) Combining Signals ----------------- Both the Deanonymizer and Reanonymizer would be amenable to being "combined"--multiple results could be merged together to find a "combination" of results which best de/reanonymize the users. (A linear combination may suffice.) Such a modularization technique would be very interesting, and allow new contributors to help find signals.