Seth Larson wrote a great blog post on querying a PyPI dataset to look for trends in the use of memory-safe languages in Python.
Check out Seth’s article for more information on the dataset (and it’s a good read!). It caught our eye because it makes use of DuckDB to clean the data for analysis.
That’s right up our alley here in Ibis land, so let’s see if we can duplicate Seth’s results (and then continue on to plot them!)
Grab the data (locations)
Seth showed (and then safely decomposed) a nested curl
statement and that’s always viable – we’re in Python land so why not grab the filenames using urllib3
?
import urllib3
url = "https://raw.githubusercontent.com/pypi-data/data/main/links/dataset.txt"
with urllib3.PoolManager() as http:
resp = http.request("GET" , url)
parquet_files = resp.data.decode().split()
parquet_files
['https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-0.parquet',
'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-1.parquet',
'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-10.parquet',
'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-11.parquet',
'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-12.parquet',
'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-13.parquet',
'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-14.parquet',
'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-2.parquet',
'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-3.parquet',
'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-4.parquet',
'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-5.parquet',
'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-6.parquet',
'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-7.parquet',
'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-8.parquet',
'https://github.com/pypi-data/data/releases/download/2023-11-27-03-06/index-9.parquet']
Grab the data
Now we’re ready to get started with Ibis!
DuckDB is clever enough to grab only the parquet metadata. This means we can use read_parquet
to create a lazy view of the parquet files and then build up our expression without downloading everything beforehand!
import ibis
1 from ibis import _
ibis.options.interactive = True
1
See https://ibis-project.org/how-to/analytics/chain_expressions.html for docs on the deferred operator!
Create a DuckDB connection:
con = ibis.duckdb.connect ()
And load up one of the files (we can run the full query after)!
pypi = con.read_parquet(parquet_files[0 ], table_name= "pypi" )
ibis.Schema {
project_name string
project_version string
project_release string
uploaded_on timestamp
path string
archive_path string
size uint64
hash binary
skip_reason string
lines uint64
repository uint32
}
Query crafting
Let’s break down what we’re looking for. As a high-level view of the use of compiled languages, Seth is using file extensions as an indicator that a given filetype is used in a Python project.
The dataset we’re using has every file in every project – what criteria should we use?
We can follow Seth’s lead and look for things:
A file extension that is one of: asm
, cc
, cpp
, cxx
, h
, hpp
, rs
, go
, and variants of F90
, f90
, etc… That is, C, C++, Assembly, Rust, Go, and Fortran.
We exclude matches where the file path is within the site-packages/
directory.
We exclude matches that are in directories used for testing.
expr = pypi.filter (
[
_.path.re_search(r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$" ),
~ _.path.re_search(r"(^|/)test(|s|ing)" ),
~ _.path.contains("/site-packages/" ),
]
)
expr
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━┓
┃ project_name ┃ project_version ┃ project_release ┃ uploaded_on ┃ path ┃ archive_path ┃ size ┃ hash ┃ skip_reason ┃ lines ┃ repository ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━┩
│ string │ string │ string │ timestamp │ string │ string │ uint64 │ binary │ string │ uint64 │ uint32 │
├─────────────────┼─────────────────┼──────────────────────────────┼─────────────────────────┼──────────────────────────────────────────────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────┼────────┼──────────────────────────────────────────────────────────────────────┼─────────────┼────────┼────────────┤
│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/support.c │ 1607 │ b'\xca\x0c\xf2\\R\x83\xefS\x0c\xe4\x0c\x15`\x1fM\x16"\x93\x88\x08' │ ~ │ 66 │ 1 │
│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/stemmer.c │ 5054 │ b'y\xc3A\x12\x17\xd4\xeb\xbb\xcfan\xfd\x80\xbac\x18\xcf\xc0W\x9a' │ ~ │ 230 │ 1 │
│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 313 │ b'\x81s\xa1t\x86}\xf9\xe5\xb5Zt\xcb\xd3\xae\nHfe\x8c\x9d' │ ~ │ 16 │ 1 │
│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 80922 │ b'\xae<\xc7f\x02\xc5{\xc50\xf4\xdc\x8fa\x1a\t..k\xd5\x9d' │ ~ │ 2205 │ 1 │
│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 313 │ b'\x14D\xeb\xb4\x9ac\xab\x14:b\xa4\xba\xa5\x9f\x1f\x06\xce\x0bj\xf2' │ ~ │ 16 │ 1 │
│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 10684 │ b"!\xa259'\x94\xc7.\x16\x0b\x08\x95J\x0e\xef\x86{\x0e\xd6\x8f" │ ~ │ 309 │ 1 │
│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 313 │ b'\x10W.\xcc7\x08=WV\xde\x1bP9\x03w\x03\xa2\x8c\xe7\xec' │ ~ │ 16 │ 1 │
│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 41620 │ b'\x95P\xd6|\x85\x97\xb2H\x14\xa0d<q-iu\xc1\x98h\xbb' │ ~ │ 1097 │ 1 │
│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 313 │ b'N\xf7t\xdd\xcc\xbb8Y\x0b\xbc\xd5No_\x8d\xc7\xf2\x80\x10\xd0' │ ~ │ 16 │ 1 │
│ zopyx.txng3.ext │ 3.3.2 │ zopyx.txng3.ext-3.3.2.tar.gz │ 2010-03-06 16:09:43.735 │ packages/zopyx.txng3.ext/zopyx.txng3.ext-3.3.2.tar.gz/zopyx.txng3.ext-3.3.2/zop… │ zopyx.txng3.ext-3.3.2/zopyx/txng3/ext/stemmer_src/libstemmer_c/src_c/stem_UTF_8… │ 25440 │ b'o\n\x96M+\xb0\xfbV\xaa6<*\xc8\xb0B\x03\x8a\xa9\xc3\x10' │ ~ │ 694 │ 1 │
│ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │ … │
└─────────────────┴─────────────────┴──────────────────────────────┴─────────────────────────┴──────────────────────────────────────────────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────┴────────┴──────────────────────────────────────────────────────────────────────┴─────────────┴────────┴────────────┘
That could be right – we can peak at the filename at the end of the path
column to do a quick check:
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ ArrayIndex(StringSplit(path, '/'), -1) ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ string │
├────────────────────────────────────────┤
│ support.c │
│ stemmer.c │
│ stem_UTF_8_turkish.h │
│ stem_UTF_8_turkish.c │
│ stem_UTF_8_swedish.h │
│ stem_UTF_8_swedish.c │
│ stem_UTF_8_spanish.h │
│ stem_UTF_8_spanish.c │
│ stem_UTF_8_russian.h │
│ stem_UTF_8_russian.c │
│ … │
└────────────────────────────────────────┘
Ok! Next up, we want to group the matches by:
The month that the package / file was published For this, we can use the truncate
method and ask for month as our truncation window.
The file extension of the file used
expr.group_by(
month= _.uploaded_on.truncate("M" ),
ext= _.path.re_extract(r"\.([a-z0-9]+)$" , 1 ),
).aggregate()
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ month ┃ ext ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ timestamp │ string │
├─────────────────────┼────────┤
│ 2015-12-01 00:00:00 │ cpp │
│ 2015-11-01 00:00:00 │ h │
│ 2015-12-01 00:00:00 │ f90 │
│ 2015-11-01 00:00:00 │ hpp │
│ 2010-11-01 00:00:00 │ c │
│ 2010-07-01 00:00:00 │ h │
│ 2010-12-01 00:00:00 │ cpp │
│ 2010-03-01 00:00:00 │ h │
│ 2011-08-01 00:00:00 │ cpp │
│ 2010-05-01 00:00:00 │ h │
│ … │ … │
└─────────────────────┴────────┘
That looks promising. Now we need to grab the package names that correspond to a given file extension in a given month and deduplicate it. And to match Seth’s results, we’ll also sort by the month in descending order:
expr = (
expr.group_by(
month= _.uploaded_on.truncate("M" ),
ext= _.path.re_extract(r"\.([a-z0-9]+)$" , 1 ),
)
.aggregate(projects= _.project_name.collect().unique())
.order_by(_.month.desc())
)
expr
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ month ┃ ext ┃ projects ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ timestamp │ string │ array<string> │
├─────────────────────┼────────┼─────────────────────────────────────────────────────┤
│ 2017-07-01 00:00:00 │ c │ [ 'newrelic' , 'nuclitrack' , ... +262 ] │
│ 2017-07-01 00:00:00 │ asm │ [ 'pwntools' , 'fibers' , ... +6 ] │
│ 2017-07-01 00:00:00 │ rs │ [ 'rust-pypi-example' , 'tokio' , ... +2 ] │
│ 2017-07-01 00:00:00 │ f │ [ 'okada-wrapper' , 'numpy' , ... +6 ] │
│ 2017-07-01 00:00:00 │ cpp │ [ 'pipcudemo' , 'pyDEM' , ... +108 ] │
│ 2017-07-01 00:00:00 │ f90 │ [ 'pySpecData' , 'numpy' , ... +8 ] │
│ 2017-07-01 00:00:00 │ cxx │ [ 'pytetgen' , 'python-libsbml-experimental' , ... +8 ] │
│ 2017-07-01 00:00:00 │ go │ [ 'pre-commit' , 'django-instant' , ... +5 ] │
│ 2017-07-01 00:00:00 │ cc │ [ 'nixio' , 'pogeo' , ... +14 ] │
│ 2017-07-01 00:00:00 │ h │ [ 'numba' , 'p4d' , ... +222 ] │
│ … │ … │ … │
└─────────────────────┴────────┴─────────────────────────────────────────────────────┘
Massage and plot
Let’s continue and see what our results look like.
We’ll do a few things:
Combine all of the C and C++ extensions into a single group by renaming them all.
Count the number of distinct entries in each group
Plot the results!
collapse_names = expr.mutate(
ext= _.ext.re_replace(r"cxx|cpp|cc|c|hpp|h" , "C/C++" )
.re_replace("^f.*$" , "Fortran" )
.replace("rs" , "Rust" )
.replace("go" , "Go" )
.replace("asm" , "Assembly" )
.nullif("" ),
).dropna("ext" )
collapse_names
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ month ┃ ext ┃ projects ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ timestamp │ string │ array<string> │
├─────────────────────┼──────────┼─────────────────────────────────────────────────────┤
│ 2017-07-01 00:00:00 │ C/C++ │ [ 'pipcudemo' , 'pyDEM' , ... +108 ] │
│ 2017-07-01 00:00:00 │ Fortran │ [ 'numpy' , 'pySpecData' , ... +8 ] │
│ 2017-07-01 00:00:00 │ Go │ [ 'pre-commit' , 'ronin' , ... +5 ] │
│ 2017-07-01 00:00:00 │ C/C++ │ [ 'pytetgen' , 'python-libsbml-experimental' , ... +8 ] │
│ 2017-07-01 00:00:00 │ C/C++ │ [ 'newrelic' , 'nuclitrack' , ... +262 ] │
│ 2017-07-01 00:00:00 │ Fortran │ [ 'okada-wrapper' , 'numpy' , ... +6 ] │
│ 2017-07-01 00:00:00 │ Assembly │ [ 'pwntools' , 'xmldirector.plonecore' , ... +6 ] │
│ 2017-07-01 00:00:00 │ Rust │ [ 'rust-pypi-example' , 'tokio' , ... +2 ] │
│ 2017-07-01 00:00:00 │ C/C++ │ [ 'numba' , 'numpythia' , ... +222 ] │
│ 2017-07-01 00:00:00 │ C/C++ │ [ 'pyemd' , 'pogeo' , ... +19 ] │
│ … │ … │ … │
└─────────────────────┴──────────┴─────────────────────────────────────────────────────┘
Note that now we need to de-duplicate again, since we might’ve had separate unique entries for both an h
and c
file extension, and we don’t want to double-count!
We could rewrite our original query and include the renames in the original group_by
(this would be the smart thing to do), but let’s push on and see if we can make this work.
The projects
column is now a column of string arrays, so we want to collect all of the arrays in each group, this will give us a “list of lists”, then we’ll flatten
that list and call unique().length()
as before.
DuckDB has a flatten
function, but it isn’t exposed in Ibis (yet!).
We’ll use a handy bit of Ibis magic to define a builtin
UDF
that will map directly onto the underlying DuckDB function (what!? See here for more info):
@ibis.udf.scalar.builtin
def flatten(x: list [list [str ]]) -> list [str ]:
...
collapse_names = collapse_names.group_by(["month" , "ext" ]).aggregate(
projects= flatten(_.projects.collect())
)
collapse_names
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ month ┃ ext ┃ projects ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ timestamp │ string │ array<string> │
├─────────────────────┼──────────┼─────────────────────────────────────────────┤
│ 2009-07-01 00:00:00 │ C/C++ │ [ 'gevent' , 'hashlib' , ... +52 ] │
│ 2009-07-01 00:00:00 │ Assembly │ [ 'pycryptopp' ] │
│ 2008-10-01 00:00:00 │ Fortran │ [ 'numscons' ] │
│ 2008-08-01 00:00:00 │ Fortran │ [ 'numscons' ] │
│ 2008-06-01 00:00:00 │ C/C++ │ [ 'dm.incrementalsearch' , 'Cython' , ... +45 ] │
│ 2008-05-01 00:00:00 │ Fortran │ [ 'numscons' ] │
│ 2007-03-01 00:00:00 │ Fortran │ [ 'Model-Builder' ] │
│ 2005-05-01 00:00:00 │ C/C++ │ [ 'll-xist' , 'll-xist' ] │
│ 2005-03-01 00:00:00 │ C/C++ │ [ 'pygenx' , 'pygenx' ] │
│ 2011-08-01 00:00:00 │ Fortran │ [ 'pysces' , 'ffnet' ] │
│ … │ … │ … │
└─────────────────────┴──────────┴─────────────────────────────────────────────┘
We could have included the unique().length()
in the aggregate
call, but sometimes it’s good to check that your slightly off-kilter idea has worked (and it has!).
collapse_names = collapse_names.select(
_.month, _.ext, project_count= _.projects.unique().length()
)
collapse_names
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ month ┃ ext ┃ project_count ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ timestamp │ string │ int64 │
├─────────────────────┼─────────┼───────────────┤
│ 2007-03-01 00:00:00 │ C/C++ │ 6 │
│ 2006-01-01 00:00:00 │ C/C++ │ 5 │
│ 2005-10-01 00:00:00 │ C/C++ │ 2 │
│ 2011-08-01 00:00:00 │ C/C++ │ 57 │
│ 2011-03-01 00:00:00 │ C/C++ │ 63 │
│ 2011-01-01 00:00:00 │ Fortran │ 2 │
│ 2010-12-01 00:00:00 │ C/C++ │ 48 │
│ 2010-08-01 00:00:00 │ Fortran │ 1 │
│ 2010-07-01 00:00:00 │ Fortran │ 3 │
│ 2010-03-01 00:00:00 │ Fortran │ 1 │
│ … │ … │ … │
└─────────────────────┴─────────┴───────────────┘
Now that the data are tidied, we can pass our expression directly to Altair and see what it looks like!
import altair as alt
chart = (
alt.Chart(collapse_names.to_pandas())
.mark_line()
.encode(x= "month" , y= "project_count" , color= "ext" )
.properties(width= 600 , height= 300 )
)
chart
That looks good, but it definitely doesn’t match the plot from Seth’s post:
Our current plot is only showing the results from a subset of the available data. Now that our expression is complete, we can re-run on the full dataset and compare.
The full run
To recap – we pulled a lazy view of a single parquet file from the pypi-data
repo, filtered for all the files that contain file extensions we care about, then grouped them all together to get counts of the various filetypes used across projects by month.
Here’s the entire query chained together into a single command, now running on all of the parquet
files we have access to:
pypi = con.read_parquet(parquet_files, table_name= "pypi" )
full_query = (
pypi.filter (
[
_.path.re_search(
r"\.(asm|c|cc|cpp|cxx|h|hpp|rs|[Ff][0-9]{0,2}(?:or)?|go)$"
),
~ _.path.re_search(r"(^|/)test(|s|ing)" ),
~ _.path.contains("/site-packages/" ),
]
)
.group_by(
month= _.uploaded_on.truncate("M" ),
ext= _.path.re_extract(r"\.([a-z0-9]+)$" , 1 ),
)
.aggregate(projects= _.project_name.collect().unique())
.order_by(_.month.desc())
.mutate(
ext= _.ext.re_replace(r"cxx|cpp|cc|c|hpp|h" , "C/C++" )
.re_replace("^f.*$" , "Fortran" )
.replace("rs" , "Rust" )
.replace("go" , "Go" )
.replace("asm" , "Assembly" )
.nullif("" ),
)
.dropna("ext" )
.group_by(["month" , "ext" ])
.aggregate(project_count= flatten(_.projects.collect()).unique().length())
)
chart = (
alt.Chart(full_query.to_pandas())
.mark_line()
.encode(x= "month" , y= "project_count" , color= "ext" )
.properties(width= 600 , height= 300 )
)
chart
Back to top