Are They Toeing the Line? Diagnosing Privacy Compliance Violations among Browser Extensions
Our approach utilizes the state-of-the-art language processing model BERT for annotating the policy texts, and a hybrid technique to analyze the privacy-related elements (e.g., API calls and HTML objects) from the static source code and dynamically generated files during runtime. We collect a comprehensive dataset within 42 hours in April 2022, containing a total of 64,114 extensions. To facilitate the model training, we construct a corpus named PrivAud-100 which contains 100 manually annotated privacy policies. Based on this dataset and the corpus, we conduct a systematic audition, and identify widespread privacy compliance issues. We find around 92% of the extensions have at least one violation in either their privacy policies or data collection practices. We further propose an index to facilitate the filtering and identification of extensions with significant probability of privacy compliance violations. Our work should raise the awareness from the extension users, service providers, and platform operators, and encourage them to implement solutions towards better privacy compliance. To facilitate future research in this area, we have released our dataset.