Linking Online Profiles
I took a seminar called Quantifying Data Privacy during my senior year at Wellesley College. The primary goal of the course was to explore the different ways data scientists and computer scientists measure data privacy. We did this by reading research papers, pouring over current event articles, and running our own experiments.
A requirement of this course was to complete a final project tying concepts we learned in the class and coding together. Over the course of the semester, I became interested in pseudo-anonymous profiles on online platforms like Reddit. In particular, I wondered if people shared enough information that they could actually be identified. For my final project, I worked with two other students as we tried to see how closely we could correlate profiles across social media platforms: Reddit, Twitter, and Instagram.
For the main part of this project, we did the following:
Grabbed 1,000 Reddit profiles off the front page and turned them into Redditor objects using the PRAW library. These Redditor objects contained information like: username, comments, popular subreddits, most common posting times, and word frequency counts.
We then compared the Redditor objects with similarly named objects on Twitter and Instagram. So, for example, we would compare a Redditor object with a username like ‘baseballFan’ to objects with the same name created from Twitter and Instagram APIs.
We did some basic comparisons to try to estimate how confident we could programmatically tie ‘baseballFan’ on Reddit to ‘baseballFan’ on Twitter or Instagram. Things we compared:
Normalized word frequencies combined with similar comment or post topics
Location and other proper noun references
Overall, through the coding portion of the project, we were only able to find slight indications of correlations between profiles across social media sites. However, if I were to redo this project today, I believe I would be able to come away with more conclusive results. Regardless, this experiment confirmed the following: people share a ton of personal information assuming that they are completely anonymous!
We also conducted informal research using Amazon Turk. We created a survey to be filled out by self-described Redditors (people who regularly interact as a pseudo-anonymous profile) about their choices and behaviors around online profiles.
Overall, we received 342 responses. Out of the respondents, 65% were male and the majority were between the ages of 18 and 35. Although the survey only provided anecdotal evidence, the results correlated to what we saw in our coding experiment and other articles we had read in class. In particular, we found these insights to be interesting:
Most people (~70%) only had one account on each platform (Reddit, Twitter, Instagram)
Women were slightly more likely to use different usernames across platforms
Younger respondents were more likely to use different names across platforms
We used a final free-text question in the survey to prompt the respondents to write about anything they considered important while making decisions around creating new online profiles. From these responses, we created a word cloud that proved to be a great discussion piece in our class.