company

Machine Learning Systems Engineer - Infrastructure & Cloud

BasisNew York Office
On-site Full-time

Clicking Apply Now takes you to AutoApply where you can tailor your resume and apply.


Unlock Your Potential

Generate Job-Optimized Resume

One Click And Our AI Optimizes Your Resume to Match The Job Description.

Is Your Resume Optimized For This Role?

Find Out If You're Highlighting The Right Skills And Fix What's Missing

Experience Level

Mid to Senior

Qualifications

Expected Qualifications:Proven expertise in ML systems engineering, including:Managing distributed training jobs across extensive GPU clustersDebugging and resolving numerical instabilities in large-scale trainingBuilding and maintaining robust ML infrastructureStrong cloud infrastructure management skillsAbility to optimize computational resources for cost efficiencyCommitment to maintaining high documentation standards

About the job

About Basis

Basis is a pioneering nonprofit organization dedicated to applied AI research, driven by two key objectives.

The first objective is to comprehend and develop intelligence. This encompasses establishing the mathematical foundations of reasoning, learning, decision-making, understanding, and explanation, while also creating software that embodies these principles.

The second objective is to enhance society’s capacity to tackle complex challenges. This involves broadening the scope, scale, and complexity of problems we can address today, and crucially, accelerating our capacity to solve future problems.

To fulfill these missions, we are constructing an innovative technological infrastructure inspired by human reasoning, along with a collaborative organization that prioritizes human values.

About the Role

As an ML Systems Engineer at Basis, you will ensure that our training and evaluation infrastructure is fast, reliable, and scalable. You will manage the entire stack, from distributed training frameworks to cloud administration, enabling researchers to rapidly iterate on complex models while efficiently managing computational resources.

We are seeking engineers who possess a profound understanding of ML systems paired with operational excellence. The ideal candidate will have experience in distributed training at scale, expertise in debugging numerical instabilities, and the ability to manage cloud infrastructure that seamlessly transitions from experimentation to production. You will be the steward of training stability, an optimizer of computational costs, and a facilitator of reproducible research.

This position encompasses both traditional ML engineering and cloud/DevOps responsibilities. You will oversee GPU clusters, optimize cloud expenditures, ensure security and compliance, and build the infrastructure that allows researchers to focus on algorithms rather than operations.

We are looking for individuals who are committed to developing robust ML infrastructure, maintaining a culture of documentation for issues and solutions, and prioritizing operational excellence as a core value.

About Basis

Basis is an innovative nonprofit organization at the forefront of applied AI research, dedicated to understanding intelligence and enhancing our ability to address complex societal challenges. We prioritize human values while leveraging cutting-edge technology to transform how we approach problem-solving.

Similar jobs

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.