Inside brian2cuda: What I Actually Built in PR #323

June 18, 2026 · Yusuf Abdul-Mateen · ~8 min read

I submitted PR #323 to brian2cuda, a CUDA GPU backend for the Brian2 spiking neural network simulator. The PR isn't merged. The main feature has unresolved feedback from the maintainer. This is the honest version of what I built, what went wrong, and what I learned.

PR: github.com/brian-team/brian2cuda/pull/323

The Project

Brian2 is a Python library for simulating spiking neural networks. You describe neurons as differential equations — dv/dt = (v_rest - v) / tau — and Brian2 compiles them into simulation code. It has multiple backends. brian2cuda is the CUDA one: it takes Brian2's equation analysis and generates CUDA C++ kernels that run on NVIDIA GPUs.

The architecture centers on CUDAStandaloneDevice in device.py, which orchestrates code generation, compilation, and GPU execution. CUDACodeGenerator translates Brian2's abstract code into CUDA C++. Twenty-five Jinja2 templates in templates/ provide the kernel structure. Runtime headers in brianlib/ provide CUDA utilities.

What I Set Out to Do

Two issues looked tractable for a first contribution:

Issue #255 — brian2cuda selected GPUs based on compute capability (newer architecture wins). A better metric is performance: number of multiprocessors × CUDA cores per multiprocessor × clock rate. The CUDA samples have a findCudaDevice function that does exactly this. I needed to port that logic into brian2cuda's device selection.
Issue #295 — The test suite had two entrypoints: the shared brian2cuda.tests.run and standalone scripts in tools/test_suite/ that ran independently. The fix was to make the scripts delegate to the shared runner.

What I Built

GPU Performance Selection (Issue #255)

The existing code in device.py called a function that ranked GPUs by compute capability string. I replaced it with a function that parses deviceQuery output to extract multiprocessor count and clock rate, computes performance = multiprocessors × cuda_cores_per_mp × clock_rate, and selects the highest-performing device. If parsing fails, it falls back to the original compute-capability logic.

def _get_device_performance():
    """Parse deviceQuery output for GPU performance metric."""
    # Run nvidia's deviceQuery sample
    # Extract: multiprocessors, cores/MP, clock rate
    # Return: performance = multiprocessors * cores_per_mp * clock_rate
    ...

Test Suite Refactoring (Issue #295)

This was straightforward. The standalone run_test_suite.py script duplicated import logic that already existed in brian2cuda/tests/__init__.py. I changed it to import and call brian2cuda.tests.run() instead.

setuptools_scm Fix

During development I found that setuptools_scm had an invalid fallback version string that caused builds to fail in CI. I changed it to a valid PEP 440 version string. This was a one-line fix but it unblocked the CI pipeline.

Where It Went Wrong

The GPU selection logic had a bug. The maintainer pointed it out:

"Your code (and your tests) assume that the line for the multiprocessors looks like: Multiprocessors, (128) CUDA Cores/MP: 80 multiprocessors but this is not the correct format. On my laptop this looks like: (14) Multiprocessors, ( 64) CUDA Cores/MP: 896 CUDA Cores"

I had parsed deviceQuery output based on what I saw on my machine — one specific NVIDIA GPU layout. The maintainer had a different GPU with a completely different output format. My regex was too narrow.

They also asked whether deviceQuery was the right tool, since not every user will have the CUDA samples installed. A better approach might use nvidia-smi or the CUDA runtime API directly.

I acknowledged the issues and said I'd rework it. That was in March 2026. Three months later, the PR is still open with no follow-up commits.

What I Learned

Test on hardware you don't own

My GPU selection code worked perfectly on my machine. It failed on the maintainer's machine because I only tested one deviceQuery output format. Hardware-dependent code needs testing across different configurations — or a design that doesn't depend on parsing tool output at all.

A live PR is a liability, not an asset

An open PR with unresolved reviewer feedback signals abandoned work. It's worse than having no PR at all. If you can't follow through to merge, it's better to close the PR and frame the experience as a learning exercise rather than leave it in limbo.

Small wins are real wins

The test suite refactoring and the setuptools fix work correctly and stand on their own. If I had submitted those as separate, smaller PRs, they likely would have been merged quickly. The lesson: submit small, verifiable changes rather than bundling everything with a risky feature.

Know when to move on

I spent months on this PR, including the proposal writing. At some point, diminishing returns set in. The right call would have been to close the GPU selection work, get the test routing and CI fixes merged, and call that a win. Instead I let the whole thing sit.

What's Next

I'm looking for a new open-source project to contribute to — one with active maintainers, good first issues, and automated CI that gives fast feedback. The experience with brian2cuda taught me what to look for: responsiveness in the review process, a culture of merging small changes, and hardware-independent testing.

View PR #323 · Back to Blog