# Import packages, most for parsing the tweets.js file
import numpy as np
import pandas as pd
import duckdb
import json
import time
import datetime
import pytz
from dateutil.tz import tzlocal
Why Should I Care?
DuckDB has been all the rage on Data Twitter1 for ~983,000 Elon Musk attention spans. There are a lot of technical reasons to get excited,2 and I’ll focus on one in particular - its ability to process data without copying it to a database.
Working with data that’s just small enough to fit in local memory can feel like the worst of all worlds. You’re using data wrangling tools designed for data that’s smaller, but creating a database seems like overkill. Enter DuckDB, which allows you to run SQL queries directly on Pandas data frames.3
Now you can take advantage of a blazing-fast4 style of SQL on local data. And since I love getting a bit meta, let’s use this framework to analyze my Twitter archive.
Load Packages
First we’ll import the packages we need. If you want to skip the local timezone wrangling part you can eliminate the datetime, pytz, and tzlocal dependencies.
Parsing My Tweets Data
We’ll create a function to can load the tweets.js in as a pandas data frame. While the Javascript file was a little tricky at first I realized it was JSON with some extra text at the front.
Opening the file, removing that text, and then normalizing the JSON let’s us read in the tweets data as a Pandas Data Frame without too much fuss.
# Adapting a parsing function from here but extended it to go into a DF
# https://github.com/dangoldin/twitter-archive-analysis/blob/master/analyze.py
def load_tweets_from_js(js_file):
with open(js_file, "r") as f:
= f.read()
data = data.replace("window.YTD.tweets.part0 = ", "")
data = json.loads(data)
tweets for tweet in tweets:
= datetime.datetime.strptime(
ts "tweet"]["created_at"], "%a %b %d %H:%M:%S +0000 %Y"
tweet[
)= ts.replace(tzinfo=pytz.utc)
ts = ts.astimezone(tzlocal())
ts "timestamp"] = ts
tweet[print("Loaded %d tweets" % len(tweets))
= pd.json_normalize(tweets)
tweets_df return tweets_df
Some Initial Tidying with Pandas
One great aspect of using local data is we don’t have to abandon other tools we know and love to exclusively use SQL. For example, we can use Pythonic methods to rename all our columns in a way that won’t collide with the SQL interpreter5 in a single line of code.
# Use the function to create a Pandas DataFrame
= load_tweets_from_js("tweets.js")
tweets
# Replace periods with underscores so they don't confuse SQL later
= tweets.columns.str.replace(".", "_",regex = False)
tweets.columns
# Convert multiple columns to numeric at once
# Only doing with two columns here but could work with many more
= ["tweet_favorite_count","tweet_retweet_count"]
secret_numeric
= tweets[secret_numeric].apply(pd.to_numeric)
tweets[secret_numeric]
# Confirm the conversion worked (Commented out because it did!)
# tweets.dtypes
Loaded 5244 tweets
An Example DuckDB Query on a Pandas Data Frame
Now we can use DuckDB as a querying engine directly on this pandas data frame!
Here’s a simple example to start, where we use DuckDB to (gulp) look at how many likes my tweets have gotten on average:
print(duckdb.query("SELECT AVG(tweet_favorite_count) as avg_favorite_count FROM tweets").to_df())
avg_favorite_count
0 5.127002
Comparing the Speed of DuckDB and Pandas on My Twitter Archive
Ok, so how long does this take on my non-enormous Twitter archive? Let’s create a timing function first.
def time_duckdb(full_query):
= time.time()
start = duckdb.query(full_query).to_df()
duckb_query = time.time()
end print(round(end - start, 4))
"SELECT AVG(tweet_favorite_count) as avg_favorite_count FROM tweets") time_duckdb(
0.0507
Pretty quick! How does this simple DuckDB query compare to an equivalent Pandas operation?
= time.time()
start_pd_simple = tweets.agg(avg_favorite_count = ("tweet_favorite_count", "mean"))
test_pandas_avg = time.time()
end_pd_simple print(round(end_pd_simple - start_pd_simple, 4))
0.0039
On simple queries on non-enormous data6 Pandas seems to still be much faster than DuckDB. That lead would diminish and eventually disappear on larger data.
Sometime We Might Prefer to Write SQL
And even if Pandas is faster on data of this size sometimes we might prefer to use DuckDB anyway.
For example, the syntax for doing a filtered “group by” using SQL is more intuitive to me than the equivalent Pandas syntax. Either works, and as we can see here Pandas is still faster on data at this size!
"SELECT AVG(tweet_favorite_count) as avg_favorite_count FROM tweets WHERE tweet_retweet_count > 10 GROUP BY tweet_truncated") time_duckdb(
0.0497
= time.time()
start = tweets[tweets["tweet_retweet_count"] >= 1]
filtered_pandas = filtered_pandas.groupby('tweet_truncated', as_index=False).agg(avg_favorite_count = ("tweet_favorite_count", "mean"))
grouped_avg_pandas = time.time()
end print(round(end - start, 4))
0.0057
And since in this case the speed differential is negligible7 having another option for data wrangling can be handy.
DuckDB also has the virtue of being far more scalable than Pandas. That scalability extends to much larger data, much more complex data wrangling, and both of those complexities combined.
Conclusion
I had a good time exploring a more compact use-case for DuckDB. I was a bit surprised how much faster the Pandas queries ran vs. DuckDB at “small-ish data” size. I figured the two appraoaches would be closer in speed, though this may be a quirk of running Python through Quarto.
I’m hoping this is just a warm-up act for my journey with DuckDB! I’m especially excited about DuckDB as a key catalyst for running the Modern Data Stack on a laptop, and if you’re also intrigued I recommend checking out this proof-of-concept by Jacob Matson.
If you have any questions or comments I’d love to hear from you! For at least a while I’m on Twitter, and other contact info is on my website.
Footnotes
Data Mastodon? Data 138-Different-Discord-Servers?↩︎
No external dependencies like SQLite with a lot more features, thorough testing with CI, open source↩︎
As of me writing this I don’t believe you can work directly on data frames or tibble in R, though you can interface with DuckDB using the dbplyr package https://duckdb.org/docs/api/r↩︎
Or maybe not, as we’ll see soon!↩︎
Replacing all the “.” in variable names with “_” so the SQL interpreter doesn’t think we’re referencing a variable within another table↩︎
In this case ~10 MB↩︎
Both styles of queries are more than fast enough!↩︎