...
By the end of 2018, a total of more than 1.04 million articles were included. Out of these, roughly 11,000 have been removed by the internet censorship system.
This approach allow us to detect content that was removed after it was posted — but it misses content that was censored prior to publication. The majority of Chinese social media platforms are equipped with a keyword filter that allows them to automatically censor sensitive information before it is published. A list of keywords is created and constantly updated by state censorship authorities, and then handed down to platform operators. This explains why certain politically sensitive topics such as the Xinjiang re-education camps rarely appear in our dataset.