[CSEE-colloq] talk: Analytics for Detecting Web and Social Media Abuse, 1pm 3/16, ITE325, UMBC

Thu Mar 15 18:03:01 EDT 2012

           Analytics for Detecting Web and Social Media Abuse

                         Justin Ma, UC Berkeley

               1:00pm Friday 16 March 2012, ITE 325, UMBC

The Web and online social media provide invaluable communication
services to a global Internet user base. The tremendous success of
these services, however, has also created valuable opportunities for
criminals and other miscreants to abuse them for their own gain. As a
result, it is both an important yet challenging problem to detect,
monitor, and curtail this abuse. However, the large scale and
diversity of these services, combined with the tactics used by
attackers, make it difficult to discern one clear and robust signal
for detecting abuse. One approach, relying on domain expertise, is to
construct a small set of well-crafted heuristics, but such heuristics
tend to rapidly become obsolete. In this talk, I will describe more
robust approaches based on machine learning, statistical modeling, and
large-scale analytics of large data sets.

First I will describe online learning approaches for detecting
malicious Web sites (those involved in criminal scams) using lexical
and host-based features of the associated URLs. This application is
particularly appropriate for online algorithms as the size of the
training data is larger than can be efficiently processed in batch and
because the features that typify malicious URLs evolve continuously.
Motivated by this application, we built a real-time system to gather
URL features and analyze them against a source of labeled URLs from a
large Web mail provider. Our system adapts in an online fashion to the
evolving characteristics of malicious URLs, achieving daily
classification accuracies up to 99% over a balanced data set.

Next I will describe our ongoing efforts for creating analytics for
detecting social media abuse. Deciding on a universal definition of
social media abuse is difficult, as abuse is often in the eye of the
beholder. In light of this challenge, we explore a more formal
definition based on information theory. In particular, we hypothesize
that messages with low information content are likely to be
abusive. From this, we develop a measure of content complexity to
identify abusive users that shows promise in our early evaluations.

In addition to our own experiments in the lab, this work has found
success in practice as well. Companies serving hundreds of millions of
users have adopted these ideas to improve abuse detection within their
own services.

Justin Ma (http://bit.ly/jtma) is a postdoc in the UC Berkeley
AMPLab. His primary research is in systems security, and his other
interests include applications of machine learning to systems
problems, systems for large-scale machine learning, and the impact of
energy availability on computing. He received B.S. degrees in Computer
Science and Mathematics from the University of Maryland in 2004, and
he received his Ph.D. in Computer Science from UC San Diego in 2010.

Host: Anupam Joshi
See http://csee.umbc.edu/talks for more information