Assessing the Scalability of Birdwatch

An Initial Investigation
Trust and Safety
Python
Twitter
Author

Michael Mullarkey

Published

November 13, 2022

Why Should I Care?

A recent Washington Post article dove deep on Twitter’s crowd-sourced fact-checking program Birdwatch. Twitter’s owner touted the volunteer-driven initiative around the same time 15% of Trust and Safety staff were laid off. A week later a large number of content moderation contractors1 were fired without notice and the head of Trust and Safety resigned.

Data related to trust and safety are rarely open for audit, but Birdwatch2 has provided open source data, code, and documentation since it started as a small pilot program in January 2021. I decided to do an initial assessment of the open data with an eye toward Birdwatch’s scalability as a content moderation tool.

Load Packages

We don’t need a lot of Python packages to do this analysis,3 and I try to keep my dependencies as light as possible without making my life a nightmare.

Code
import numpy as np
import pandas as pd

Import Birdwatch Data

We can download the data from this page, where the Birdwatch data is updated daily.

I downloaded this data on November 13, 2022. If you’re accessing this data in the future the analyses will not exactly reproduce since new data will be included. You can feel free to grab the data I’m using from the Github repo for this post.

First we’ll make a quick function to read the .tsv files in as Pandas Data Frames.

Code
def read_tsv(path):
  data = pd.read_csv(path, sep='\t')
  return data

Then we’ll apply that function to all of the Birdwatch data.

Code
paths = ["notes-00000.tsv", "noteStatusHistory-00000.tsv", "ratings-00000.tsv"]

initial_dfs = []
for path in paths:
  initial_data = read_tsv(path)
  initial_dfs.append(initial_data)

Exploring Birdwatch Data

To make it easier to explore each data frame, we’ll assign them to separate objects outside the list. The documentation for all three datasets provided by Birdwatch is here.

Code
initial_notes, initial_history, initial_ratings = initial_dfs

Before I move forward, one huge positive of this project is that data is available for outside research. Allowing external audits of content moderation approaches is tricky, and I commend the Birdwatch folks for their transparency so far.

A Quick Primer on How Birdwatch Works

Any Twitter user can join Birdwatch with the ultimate goal of adding notes to tweets. Those notes can fact-check, provide additional context, and in theory deter disinformation.

There are checks and balances on the Birdwatch system geared toward stopping bad-faith actors. While anyone can join Birdwatch you cannot write your own initial notes on tweets when you join. First, you must consistently submit ratings of others’ initial notes that agree with the other Birdwatch members’ general consensus.

Those ratings from Birdwatch members also serve as a powerful bottleneck for which initial notes ultimately appear on tweets. The ranking for whether a note is “helpful” enough to apply to a tweet is more complex than a majority vote among Birdwatch members.

Instead, initial notes that receive a few positive ratings from people who normally disagree on their ratings are more likely to be rated as helpful than initial notes that receive many positive ratings from people who normally agree.

This system is known as bridge-based ranking and algorithmically prioritizes this form of consensus over potential alternatives The Washington Post article notes this approach is unlikely to scale, especially in “an era when left and right often lack a shared set of facts.”

To see how well this approach does or does not scale right now, let’s dive into the data.

How Many Tweets Have Initial Notes?

Code
# Grouping by tweetId and counting number of notes per tweet

tweets_initial_notes = initial_notes.groupby("tweetId").count()

# Use f-strings to get key info

print(f"Birdwatchers have put initial notes on {len(tweets_initial_notes)} tweets since January of 2021.")
Birdwatchers have put initial notes on 28723 tweets since January of 2021.

How Does This Compare to the Total Volume of Tweets?

For context, there are approximately 500,000,000 tweets sent per day.

Even if we assume that only the top 0.1% of tweets require the scrutiny of Birdwatch that would mean 500 tweets should be considered for notes per day.

We can be extra generous and say fewer tweets than that might require notes, but we’d still expect around 500 notes per day. How many days in the Birdwatch data meet that criteria?

Code
# Converting to date instead of milliseconds since epoch

initial_notes["dateCreated"] = pd.to_datetime(initial_notes["createdAtMillis"], unit = "ms").dt.date

# Counting the number of notes per day

tweet_initial_dates = initial_notes.groupby(["dateCreated"]).count()

# Finding the earliest date

min_tweet_date = tweet_initial_dates.index.min()

# Finding how many dates where over 500 notes or more were created

days_500_per = tweet_initial_dates[tweet_initial_dates.tweetId >= 500]

# Getting value of only date where >500 notes were created

only_date_over = days_500_per.index.values

print(f"In all Birdwatch data going back to {min_tweet_date}, there was {len(days_500_per)} day where at least 500 notes were written - {only_date_over[0]}")
In all Birdwatch data going back to 2021-01-23, there was 1 day where at least 500 notes were written - 2021-01-28

Even with relaxed criteria, there was only 1 day at the very beginning of Birdwatch where the community reviewed approximately 0.1% of all tweets in a day.

This relatively low review volume is understandable given Birdwatch is an almost all-volunteer effort. However, this precedent of not operating at scale becomes concerning if Birdwatch is expected to play a large role in preventing disinformation on the platform.

How Many Initial Notes Need More Ratings to Determine Their Helpfulness?

Code
# Getting status for which notes were rated as helpful or not

note_status = initial_history[["noteId","currentStatus"]]

# Seeing what percentage of notes with evaluations need more evaluation

status_counts = note_status.currentStatus.value_counts()

pd.DataFrame(status_counts)\
.assign(percent = lambda x: (x["currentStatus"] / x["currentStatus"].sum()) * 100)\
.round(2)
currentStatus percent
NEEDS_MORE_RATINGS 14517 86.90
CURRENTLY_RATED_HELPFUL 1506 9.01
CURRENTLY_RATED_NOT_HELPFUL 683 4.09

Another indication that an all-volunteer effort isn’t enough to scale this form of content moderation - nearly 87% of initial notes need more ratings to determine whether they could be helpful or not.

All initial notes start out as “Needs More Ratings” until they’ve received at least 5 ratings, and it appears a vast majority of notes never meet that threshold.

There could be multiple reaons for this, ranging from charitable4 to less so.5 There could be reasons internal to the Birdwatch community I’m unaware of that drive this pattern.

And no matter what, the current Birdwatch system is failing to identify whether a vast majority initial notes are helpful. This is true even though the volume of initial notes is infentisimal compared to the total volume of tweets. If more initial notes were written to better keep up with overall tweet volume, there’s a chance this lack of ratings problem would be exacerbated.

Example: The Tweet With the Most Initial Notes Had No Notes with Enough Ratings

Code
print(f"The tweet with the most initial notes had {tweets_initial_notes.noteId.max()} notes.")

# tweets_initial_notes[tweets_initial_notes.noteId == 58]
# Can use this website to get tweets from tweetId without using the API https://www.bram.us/2017/11/22/accessing-a-tweet-using-only-its-id-and-without-the-twitter-api/
The tweet with the most initial notes had 58 notes.

The tweet with the most initial notes was by Rep. Alexandria Ocasio-Cortez in response to Senator Ted Cruz. The tweet touched on the trading platform Robinhood’s decision to prevent retail investors from trading certain stocks and the January 6th insurrection.

This tweet ultimately did not have a note attached to it.

Tweets could not have a note attached to them for 2 reasons:
1. There is no note rated as helpful
2. There is at least one note rated as helpful but the Tweet is not marked as “potentially misleading”

In this case no initial note was rated as helpful, and to boot none of the initial notes had enough ratings to even be considered.

Code
# Getting all noteIds in reference to the AOC tweet into a list

aoc_note_ids = initial_notes[initial_notes["tweetId"] == 1354848253729234944].noteId.to_list()

# Filtering the note history based on this list and counting values

initial_history[initial_history["noteId"].isin(aoc_note_ids)].currentStatus.value_counts()
NEEDS_MORE_RATINGS    54
Name: currentStatus, dtype: int64

Even if you believe this tweet should not have received a note,6 it’s troubling that its status remained up in the air rather than seeing a definitive “not helpful” label applied to all initial notes.

Would More Birdwatch Members Solve All These Problems?

The two previous scalability issues could, at least in principle, be solved by having a lot more people joining Birdwatch. More initial notes could be written, more initial notes could receive ratings, and the system could achieve at least some scalability.

However, there are reasons to believe that using its current standards more members could actually make Birdwatch less scalable.

Think back to the example of the tweet with the most initial notes ever. Lots of people wrote initial notes, but nowhere near enough people rated all those initial notes.

It’s possible Birdwatch has better procedures in place now, but it seems like more Birdwatch members could exacerbate this coordination problem. Too many people writing initial notes, and - after the initial probationary period - not enough people rating initial notes.

Summary of Findings

  1. Birdwatch currently reviews an extraordinarily low volume of overall tweets
  2. A vast majority of initial notes do not receive enough ratings to assess their helpfulness
  3. Birdwatch is not likely to scale well in its current form

Suggested Action Items

  1. Pay people to perform Birdwatch’s functions

    Paying people fair wages and professionalizing the approach to which tweets receive notes would likely yield a fantastic return on investment. This approach would help increase the volume of initial notes, ensure enough ratings for each initial note, and provide crucial coordination capabilities such as consistently prioritizing high-value tweets.7 If Twitter wants to retain a bottom-up, participatory research style branch of content moderation they can do so. However, I recommend they consult with Trust and Safety professionals along with Birdwatch members on what form that participatory research should take.

  2. Re-situate Birdwatch as one of many content moderation tools

    Birdwatch doesn’t need to massively scale to be useful. It could function well as a small part of a suite of content moderation tools. However, if Twitter relies on the current iteration of Birdwatch as its central content moderation tool I would expect massive content moderation failures. Not because Birdwatch community members are performing poorly, but because they would be community members performing the tasks of many teams’ worth of employees.

Conclusion

This analysis is only possible because Birdwatch has open data. Trust and Safety measures require some procedures be kept under lock and key, and the considered transparency baked into Birdwatch’s approach since January 2021 is admirable. The now-defunct META team at Twitter also made considered transparency a consistent practice.

The more teams follow these examples the better we’ll be able to moderate content in helpful, just ways.

There are many stones still left to turn in this data. For example, I think8 that a vast majority9 of the tweets with initial notes are in English. Someone could look into that and contextualize the volume of tweets Birdwatch hasn’t even attempted to moderate.

I hope I can inspire at least a couple of other people to take a closer look, and if you find anything interesting please get in touch.

Footnotes

  1. Including a contractor making critical changes to child safety workflows https://twitter.com/CaseyNewton/status/1591608307927556096?s=20&t=4lurUg2rjlnq6mZ8xqquNQ↩︎

  2. Now referred to by some people as Community Notes, though I’ll be using Birdwatch throughout↩︎

  3. And if you don’t care about the code you can ignore it. You don’t need to know Python to read this post!↩︎

  4. The Birdwatch community actively doesn’t bother rating initial notes from obvious trolls, notes on low value tweets, or some other combination of undesirable features↩︎

  5. There just aren’t enough people who can volunteer their time to such an intensive effort so most initial notes never receive enough ratings↩︎

  6. Cards on the table, I don’t think this tweet needs a note↩︎

  7. I think the Birdwatch community has on balance elected to prioritize high-value tweets such as misleading tweets from Twitter’s owner. I’m also certain Twitter has more data that could help a system like this nip misinformation in the bud before it’s accrued millions of impressions↩︎

  8. But haven’t directly confirmed!↩︎

  9. And maybe all↩︎