Zero Entropy Language
=====================

Idea. Maybe interesting. How to prevent stylometry?  Many techniques have been
suggested.

Most successful technique appears to be "catting".  The user imitates another
user's writing style.

Here is another idea.  What if we remove the concept of a "writing style"?
What if we use an alternative language, one which only allows one way to say
any given phrase? Take the idea: "there is one right way to do it", and apply
it to language.  Identifying users will still be possible. Of course. This new
language removes some bits of entropy, and could be useful.

Candidates for this mythical language? Maybe Lojban, or some other unambiguous
language.  The ideas of removing ambiguity seems correlated.  But I have not
the knowledge of these conlanguages to decide if it is helpful in reality.

Another idea: English, with a limited vocabulary. Limit sentences to the active
tense. Limit words to the n most common.

One last thought. Maybe this could be combined with machine translation.  This
way, people can enter their thoughts in English, and get an anonymized version.

I need to think about this in more details.


Update
======

I did some research on the above techniques. Some updates.

Most languages, such as Lojban, won't work for this purpose. These languages
still allow for things like different subject-verb orders, which could help
deanonymize a user.  The lack of good corpuses for most constructed languages
would make this difficult.

Punctuation turns out to be another stylometric factor. maybe everyone could
type like this in run on sentences with no punctuation whatsoever then it would
be harder to figure out who was typing what. However word orders and unique
phrasings can still give one away, so we need to do more than that.

Some automated software, like Hemingway, can suggest word replacements and
using the active tense. This might remove some further signals. This software
is available as a web application, however the Javascript is simple enough to
reverse-engineer. I might make a repository for doing so.

A decent amount of research is available on the WWW here:
https://www.freehaven.net/anonbib/
The focus is on general anonymity, but some of the content is relevant to
stylometry.


GAN
===

Right now my goal is to set up an evaluation framework.  I will add some
publicly available data to serve as a corpus.

    (1) Deanonymizer: attempts to "deanonymize" the users.
    (2) Reanonymizer: attempts to "reanonymize" the users. (I.e., make it so
    that they cannot be caught by the deanonymizer.)
    (3) Meaningizer: attempts to check that the reanonymized texts preserve the
    meaning of the original.

It seems that there is a tradeoff between these parts. If we get rid of
Meaningizer, then we can anonymize users very easily: just delete all their
writing! The goal of the Reanonymizer should be to get the message out while
removing common tells which may give the user away.

The interaction between the Deanonymizer and the Reanonymizer reminds me of the
concept of a generative adversial network in machine learning.

Corpus
------

The first question: where to get a corpus?  Ideally, such a corpus would be
representative of the sort of discussion we want to anonymize.  I want to
produce an anonymization software for forum posts (as an example).

Possible data sources:
    - PSAL Corpus [psal]
    - Reddit, other news aggregators
    - Email dumps or leaks (Enron, Hacking Team, and so on)
        - Enron dump is in PSAL corpus
    - TODO: more?

[psal]: https://github.com/psal/anonymouth/tree/master/jsan_resources/corpora

Deanonymizer
------------

TODO: What techniques are used in the literature to deanonymize users?

Possible deanonymizers:
    - JStylo [jstylo]
        - Basic-9
        - Writeprints
    - Simple baseline techniques?

[jystlo]: https://github.com/psal/jstylo

Reanonymizer
------------

TODO: What techniques are used in the literature to anonymize users?  (These
techniques must be automatic.)

Possible reanonymizers:
    - Anonymouth [anonymouth]
        Not automatic. 
        Small sample size (because its not anonymous).
        Focus on maintaining "meaning".
    - Hemingway [hemingway]
        Not automatic.
        Basically a glorified thesaurus.
        Some of it could be automated.
        Would lose "meaning".

[anonymouth]: https://github.com/psal/anonymouth
[hemingway]: http://www.hemingwayapp.com/

Meaningizer
-----------

TODO: What techniques in the NLP literature are used to compare meaning?
Ideally, these techniques would work well for at the sentence level, but also
at the document or paragraph level. A reanonymizer might end up reordering or
breaking apart sentences.

TODO: one possible techniques would be to have humans evaluate this, for
example by using software like MTurk. This could be a good validation technique
for whatever metrics we use?

Metrics
-------

How do we measure if a given deanonymizer is successful? (What should the
desired output format of our deanonymizer even be?)

Combining Signals
-----------------

Both the Deanonymizer and Reanonymizer would be amenable to being
"combined"--multiple results could be merged together to find a "combination"
of results which best de/reanonymize the users. (A linear combination may
suffice.) Such a modularization technique would be very interesting, and allow
new contributors to help find signals.