Kernel Engineer At Magic Dev San Francisco jobs in San Francisco – Browse 11,145 openings on RoboApply Jobs

Kernel Engineer At Magic Dev San Francisco jobs in San Francisco

Open roles matching “Kernel Engineer At Magic Dev San Francisco” with location signals for San Francisco. 11,145 active listings on RoboApply Jobs.

11,145 jobs found

1 - 20 of 11,145 Jobs
Apply
company
Full-time|On-site|San Francisco

At Magic, we are on a mission to develop safe AGI that propels humanity's progress in addressing the world's most significant challenges. We believe that automating research and code generation is the most promising pathway to achieving safe AGI, enabling us to enhance models and address alignment issues more reliably than humans can achieve alone. Our innovative approach integrates frontier-scale pre-training, domain-specific reinforcement learning, ultra-long context, and inference-time computing to realize our vision.About the RoleAs a Software Engineer at Magic, you will engage in developing core systems and product surfaces that directly influence model capabilities and enhance user experience.This position can align with areas such as Pre-training Data, RL Research & Environments, or Product Development, depending on your background and expertise. Regardless of placement, you will be expected to take full ownership of your work: identifying problems, crafting solutions, deploying to production, and iterating based on real-world results.Working with Magic's long-context models presents unique technical challenges, including large-scale data acquisition, long-horizon post-training loops, and developing product workflows that make complex model behaviors understandable and manageable. You will work closely with these constraints, creating systems that are both technically sound and production-ready.This role has the potential to evolve into a deeper specialization in data systems, post-training capability enhancement, or product engineering leadership based on your strengths and interests.What You'll Work OnDepending on your team assignment, your tasks may include:Developing and scaling large distributed data pipelines for pre-trainingDesigning filtering, mixture, and dataset versioning systemsCreating post-training datasets, evaluation frameworks, and reward pipelinesConducting ablations that translate capability goals into quantifiable improvementsBuilding comprehensive product interfaces that integrate seamlessly with the modelDesigning APIs, backend services, and frontend workflows for AI-first experiencesEnhancing the reliability, observability, and performance of production systemsWhat We’re Looking ForSolid foundation in software engineering principlesHigh ownership and comfort in navigating ambiguous problem domainsProven experience in building scalable production systemsAbility to reason through complex technical challenges

Feb 28, 2026
Apply
companyKernel logo
Full-time|On-site|San Francisco

Join Our Team at KernelAt Kernel, we are revolutionizing the way developers interact with the digital world through our innovative platform, offering Lightning-Fast Browsers-as-a-Service for seamless browser automation and advanced web agents. Our cutting-edge API and MCP server empower developers to effortlessly launch browsers in the cloud, eliminating the complexities of infrastructure management.Our serverless browser platform takes the hassle out of autoscaling, reliability, and observability, allowing developers to concentrate on their agents' functionality rather than the underlying processes. Kernel transforms AI into a practical and impactful tool, enabling developers to deploy agents that can genuinely engage with online environments.Trusted by industry leaders such as Cash App and Rye for applications ranging from comprehensive research to QA automation and real-time web analysis, we have successfully raised $22M from prominent investors including Accel, YCombinator, and others.With just one line of code, any web agent can be deployed to our cloud—what happens next is up to you. If you are passionate about creating essential infrastructure for the future of AI applications, we would love to connect.

Dec 4, 2025
Apply
companyKernel logo
Full-time|On-site|San Francisco

About KernelKernel is an innovative developer platform that delivers Lightning-Fast Browsers-as-a-Service for browser automation and web agent deployment. Our API and MCP server empower developers to effortlessly launch cloud-based browsers without the hassle of infrastructure management.Our serverless browser solution takes care of the complexities: autoscaling, dependable browser infrastructure, observability, and intricate web interactions, allowing developers to concentrate on their agents' functionality rather than the underlying technology. Kernel brings AI to life, enabling developers to create agents that genuinely engage with the digital landscape.Our platform is trusted by teams at Cash App, Rye, and many others for various tasks including in-depth research, QA automation, and real-time web analysis. We recently secured $22M in funding from notable investors such as Accel, YCombinator, Vercel, Paul Graham, Solomon Hykes (Docker), David Cramer (Sentry), and Charlie Marsh (Astral).With just a single line of code, you can deploy any web agent to our cloud infrastructure. If you are passionate about developing essential infrastructure for the future of AI applications, we would love to connect with you.

Dec 4, 2025
Apply
companyKernel logo
Full-time|On-site|San Francisco

About KernelKernel is a cutting-edge developer platform that offers Lightning-Fast Browsers-as-a-Service tailored for browser automation and web agent creation. Our API and MCP server enable developers to seamlessly launch browsers in the cloud without the hassle of infrastructure management.Our serverless browser platform takes care of the complex tasks: autoscaling reliable browser infrastructure, ensuring observability, and managing the intricate details of web interactions, allowing developers to concentrate on their agent functionalities rather than the underlying processes. Kernel brings AI to life, making it practical and powerful, empowering developers to deploy agents that can effectively engage with the digital landscape.We are trusted by teams at Cash App, Rye, and numerous others for diverse applications like in-depth research, QA automation, and real-time web analysis. We have successfully secured $22M in funding from notable investors including Accel, YCombinator, Vercel, Paul Graham, Solomon Hykes (Docker), David Cramer (Sentry), Charlie Marsh (Astral), among others.With just one line of code, you can deploy any web agent to our cloud. The rest is in your hands. If you're passionate about developing critical infrastructure for the next generation of AI applications, we would love to connect.

Dec 4, 2025
Apply
companyKernel logo
Full-time|On-site|San Francisco

About KernelKernel is a cutting-edge developer platform that offers Lightning-Fast Browsers-as-a-Service for browser automations and web agents. Our API and MCP server empower developers to effortlessly launch browsers in the cloud without the hassle of managing infrastructure.Our serverless browser platform takes care of the complex aspects: autoscaling reliable browser infrastructure, observability, and intricate web interactions, enabling developers to concentrate on the functionality of their agents rather than the underlying details. Kernel transforms AI into a tangible, practical, and powerful tool, allowing developers to deploy agents capable of genuine interaction with the digital landscape.We pride ourselves on being trusted by teams at Cash App, Rye, and numerous others for deep research, QA automation, and real-time web analysis. We have successfully secured $22M in funding from top investors including Accel, YCombinator, Vercel, Paul Graham, Solomon Hykes (Docker), David Cramer (Sentry), Charlie Marsh (Astral), and more.With just one line of code, you can deploy any web agent to our cloud. The rest is in your hands. If you are passionate about building essential infrastructure for the next wave of AI applications, we would love to hear from you.About the RoleAs a Product Engineer at Kernel, you will be a full-stack engineer who values product development as much as coding. You possess the ability to translate your strong product instincts into code, ranging from pixel-perfect UI decisions to backend API architecture. You proactively contribute to the specification process rather than waiting for one to be provided.You will collaborate closely with our co-founders to define product direction, deliver full-stack features from end to end, and ensure that Kernel maintains its polished yet powerful appearance.Your ResponsibilitiesLead the full-stack implementation of user-facing product surfaces: dashboard, onboarding, website, and core product functionalities.Influence the product roadmap by integrating customer feedback, analyzing usage patterns, and leveraging your own insights into developer needs.Enhance developer experience across our SDK, documentation, CLI, and API, delivering the kind of seamless experience that makes developers exclaim, 'this just works.'Rapidly prototype and iterate, bringing features from concept to production with minimal oversight.Help shape the standards for building a superior developer product at Kernel.Your QualificationsYou are comfortable taking ownership of features from frontend to backend, demonstrating a holistic understanding of product development.A strong passion for creating seamless user experiences and an ability to translate product vision into functional code.Experience working in a fast-paced environment with a focus on agile methodologies.

Feb 27, 2026
Apply
company
Full-time|On-site|San Francisco

At Magic, our goal is to develop safe AGI that propels humanity forward by addressing some of the most pressing challenges we face. We are committed to harnessing the power of automated research and code generation to enhance models and improve alignment in ways that surpass human capabilities. Our innovative methodology integrates cutting-edge pre-training, domain-specific reinforcement learning, ultra-long context, and advanced inference-time computing.Role OverviewAs a Kernel Engineer, you will be responsible for the design, implementation, and maintenance of high-performance kernels, aiming to optimize throughput and minimize latency during both training and inference processes.Magic's extended context windows present unique kernel optimization challenges, particularly regarding memory efficiency, data movement, and sustained throughput.Key ResponsibilitiesDesign and develop kernels that facilitate high-performance long-context functionality.Take ownership of kernel design, implementation, deployment, and ensure production reliability.Emphasize robustness, thorough testing, and functional accuracy while striving for optimal performance.Assess the feasibility of porting Magic’s compute kernels to various hardware platforms.Collaborate with the training, inference, and reinforcement learning teams to co-design kernels.Explore our work through the Magic-Attention, presented at GTC 2026.QualificationsExperience in low-level programming for AI accelerators, including NVIDIA Blackwell or Google TPUs.Proficient in developing and optimizing GPU kernels using frameworks such as NCCL, MSCCLPP, CUTLASS, CuTeDSL, Triton, Quack, and Flash Attention.

Jan 24, 2024
Apply
companyThinking Machines Lab logo
Full-time|$350K/yr - $475K/yr|On-site|San Francisco

At Thinking Machines Lab, our ambition is to enhance human potential by advancing collaborative general intelligence. We envision a future where individuals have the tools and knowledge to harness AI for their distinct requirements and aspirations.Our team comprises dedicated scientists, engineers, and innovators who have contributed to some of the most renowned AI products, including ChatGPT and Character.ai, along with open-weight models like Mistral, and influential open-source projects such as PyTorch, OpenAI Gym, Fairseq, and Segment Anything.About the RoleWe are seeking an Infrastructure Research Engineer to architect, optimize, and sustain the computational frameworks that facilitate large-scale language model training. You will create high-performance machine learning kernels (e.g., CUDA, CuTe, Triton), enable effective low-precision arithmetic operations, and enhance the distributed computing infrastructure essential for training expansive models.This position is ideal for an engineer who thrives in close collaboration with hardware and research disciplines. You will partner with researchers and systems architects to merge algorithmic design with hardware efficiency. Your responsibilities will include prototyping new kernel implementations, evaluating performance across various hardware generations, and helping to establish the numerical and parallelism strategies crucial for scaling next-generation AI systems.Note: This is an evergreen role that remains open continuously for expressions of interest. We receive numerous applications, and there may not always be an immediate opportunity that aligns with your qualifications. However, we encourage you to apply, as we regularly assess applications and will reach out as new positions become available. You are also welcome to reapply after gaining additional experience, but please refrain from applying more than once every six months. Additionally, you may notice postings for specific roles catering to particular projects or team needs. In such cases, you are encouraged to apply directly alongside this evergreen listing.What You’ll DoDesign and develop custom ML kernels (e.g., CUDA, CuTe, Triton) for key LLM operations such as attention, matrix multiplication, gating, and normalization, optimized for contemporary GPU and accelerator architectures.Conceptualize compute primitives aimed at alleviating memory bandwidth bottlenecks and enhancing kernel compute efficiency.Collaborate with research teams to synchronize kernel-level optimizations with model architecture and algorithmic objectives.Create and maintain a library of reusable kernels and performance benchmarks that serve as the foundation for internal model training.Contribute to the stability and scalability of our infrastructure, ensuring it meets the growing demands of AI development.

Nov 27, 2025
Apply
companyMagic Patterns logo
Full-time|On-site|San Francisco

Hello! I’m Alex, co-founder of Magic Patterns. We're thrilled to announce an exciting opportunity for a Head of Growth to join our dynamic team. Our product-led growth strategy is thriving, and we’re ready to accelerate our momentum. Currently, we engage on platforms like X, LinkedIn, Reddit, and YouTube, but we envision expanding our outreach significantly.At Magic Patterns, you will play a pivotal role in transforming the software development landscape. Our innovative platform is already empowering thousands of teams to deploy software more rapidly. Our mission is to assist product teams in taking their ideas from inception to production, which has attracted Fortune 500 clients and fostered a passionate community. However, we believe it's always day one, and your contribution is crucial!If you’re passionate about startups, AI, and thrive in a fast-paced environment, we can't wait to collaborate with you!

Oct 29, 2025
Apply
companyMagical logo
Full-time|Hybrid|San Francisco

About MagicalMagical is at the forefront of agentic automation, revolutionizing the healthcare landscape with cutting-edge AI technology. Our platform is designed to empower healthcare providers by automating labor-intensive tasks, allowing them to concentrate on what truly matters: patient care.By streamlining processes like claims management, prior authorizations, and eligibility assessments in an industry plagued by administrative hurdles, we're facilitating a transformative shift—one that is both necessary and inevitable.Our AchievementsWe are leading the charge in agentic automation, evidenced by:Significant revenue growth as clients expand their usage into new workflows.Quick proof-of-concept demonstrations within just 7 days—far exceeding industry norms.Reliable, self-healing automation solutions that excel where others falter.Unlike other AI companies, we deliver dependable solutions that yield tangible outcomes. With $41 million raised from renowned investors like Greylock, Coatue, and Lightspeed, our founder Harpaul Sambhi brings a wealth of expertise, having previously sold his startup to LinkedIn.About the RoleAs a Senior Software Engineer, Product on our Builder Experience team, you will harness your full-stack expertise to develop features that enable teams to create, configure, and deploy AI agents seamlessly. You will oversee the entire product interface—from user-friendly no-code tools for agent setup to dynamic dashboards for real-time monitoring and assessment.This position is crafted for engineers passionate about creating exceptional user experiences, understanding that stellar UX is crucial for making advanced technology accessible. Collaborating closely with customers and our design team, you'll deliver features that enhance agent development, all while maintaining a firm grasp of the underlying systems to create effective abstractions.This is a hybrid role, requiring you to be in our San Francisco office three days a week.

Oct 13, 2025
Apply
companyMagical logo
Full-time|Hybrid|San Francisco

About MagicalMagical is a cutting-edge automation platform that integrates advanced AI technology into the healthcare sector, providing AI agents that deliver tangible results in production environments.Our mission is to create AI-driven "employees" that streamline tedious, time-intensive workflows that hinder team productivity. We focus on the healthcare industry—a $4 trillion sector entangled in administrative challenges—by automating processes like claims processing, prior authorizations, and eligibility checks, allowing healthcare providers to dedicate more time to patient care.Our AchievementsThe move towards agentic automation in healthcare is on the horizon, and we are at the forefront:Significant revenue growth as clients expand into new workflows prior to renewalRapid 7-day proof-of-concept implementations that showcase real value, unlike the typical months-long processes in the industrySelf-healing automations that are reliable and scalable in production environments, a feat where many competitors struggleUnlike many AI companies that make grand claims, we deliver dependable solutions that yield measurable outcomes. Our funding partners include Greylock, Coatue, and Lightspeed, with a total of $41M raised. Our founder, Harpaul Sambhi, has previously achieved success by selling his first company to LinkedIn.About the RoleAs the Engineering Manager for our Autonomous team, you will lead and grow a talented group of engineers committed to shaping the future of AI agent development, continually pushing the limits of AI and backend system capabilities.Your passion for management will shine as you nurture the professional growth of your engineers. You possess the technical expertise necessary to engage in intricate architectural discussions and translate complex technical hurdles into clear business strategies. In this position, you will be a vital link between our product vision and technical implementation.This role offers a hybrid work environment, requiring 2 days a week in our San Francisco office.

Mar 6, 2026
Apply
companySciforium logo
Full-time|On-site|San Francisco

At Sciforium, we are at the forefront of AI infrastructure, innovating next-generation multimodal AI models and a proprietary high-efficiency serving platform. With substantial funding and direct collaboration from AMD, supported by their engineers, our team is rapidly expanding to develop the complete stack that powers cutting-edge AI models and real-time applications.About the RoleWe are on the lookout for a talented GPU Kernel Engineer who is eager to explore and maximize performance on modern accelerators. In this role, you will be responsible for designing and optimizing custom GPU kernels that drive our advanced large-scale AI systems. You will navigate the hardware-software stack, engaging in low-level kernel development and integrating optimized operations into high-level machine learning frameworks for large-scale training and inference.This position is perfect for someone who excels at the intersection of GPU programming, systems engineering, and state-of-the-art AI workloads, and aims to contribute significantly to the efficiency and scalability of our machine learning platform.Key ResponsibilitiesDevelop, implement, and enhance custom GPU kernels utilizing C++, PTX, CUDA, ROCm, Triton, and/or JAX Pallas.Profile and fine-tune the end-to-end performance of machine learning operations, particularly for large-scale LLM training and inference.Integrate low-level GPU kernels into frameworks such as PyTorch, JAX, and our proprietary internal runtimes.Create performance models, pinpoint bottlenecks, and deliver kernel-level enhancements that significantly boost AI workloads.Collaborate with machine learning researchers, distributed systems engineers, and model-serving teams to optimize computational performance across the entire stack.Engage closely with hardware vendors (NVIDIA/AMD) and stay updated on the latest GPU architecture and compiler/toolchain advancements.Contribute to the development of tools, documentation, benchmarking suites, and testing frameworks ensuring correctness and performance reproducibility.Must-Haves5+ years of industry or research experience in GPU kernel development or high-performance computing.Bachelor’s, Master’s, or PhD in Computer Science, Computer Engineering, Electrical Engineering, Applied Mathematics, or a related discipline.Strong programming proficiency in C++, Python, and familiarity with machine learning frameworks.

Dec 6, 2025
Apply
companyBaseten logo
Full-time|On-site|San Francisco

ABOUT BASETENAt Baseten, we empower the world's leading AI firms—such as Cursor, Notion, and OpenEvidence—by delivering mission-critical inference solutions. Our unique blend of applied AI research, robust infrastructure, and user-friendly developer tools enables AI pioneers to effectively deploy groundbreaking models. With our recent achievement of a $300M Series E funding round supported by esteemed investors like BOND and IVP, we're on an exciting growth trajectory. Join our dynamic team and contribute to the platform that drives the next generation of AI products.THE ROLEWe are looking for an experienced Senior GPU Kernel Engineer to join our innovative team at the forefront of AI acceleration. In this role, your programming expertise will directly enhance the performance of cutting-edge machine learning models. You'll be responsible for developing highly efficient GPU kernels that optimize computational processes, allowing for transformative AI applications.You'll thrive in a fast-paced, intellectually challenging environment where your technical skills are pivotal. Your contributions will directly affect production systems that serve millions of users across various platforms. This position offers exceptional opportunities for career advancement for engineers enthusiastic about low-level optimization and impactful systems engineering.EXAMPLE INITIATIVESAs part of our Model Performance team, you will engage in projects like:Baseten Embeddings Inference: The quickest embeddings solution availableThe Baseten Inference StackEnhancing model performance optimizationRESPONSIBILITIESCore Engineering ResponsibilitiesDesign and develop high-performance GPU kernels for essential machine learning operations, including matrix multiplications and attention mechanisms.Collaborate with cross-functional teams to drive performance improvements and implement optimizations.Debug and refine kernel code to achieve maximal efficiency and reliability.Stay abreast of the latest advancements in GPU technology and machine learning frameworks.

Jul 17, 2025
Apply
companyMagical logo
Full-time|On-site|San Francisco

At Magical, we are transforming the way work is accomplished.Our cutting-edge AI platform introduces "AI employees" to the workplace, tackling monotonous and draining tasks that hinder team efficiency. This empowers organizations to operate more swiftly and effectively, ultimately enhancing outcomes in critical areas such as patient care.As we spearhead the shift towards agentic work, we are rapidly scaling our product from $0 to $XM ARR in just a few months. We are seeking innovative engineers to help us achieve $XXM ARR. Joining our founding team means you will not only be coding but also influencing the future of work with a small, driven team at the forefront of AI advancements.Supported by prominent investors such as those behind OpenAI, Anthropic, Huggingface, and Notion, including Greylock, Coatue, and Lightspeed, we have a robust runway and a vast market waiting to be explored.

Dec 9, 2025
Apply
companyZyphra logo
Full-time|On-site|San Francisco

Join Zyphra as a Research Engineer specializing in AI Performance and Kernel Optimization. In this role, you will work at the forefront of AI technologies, developing and optimizing kernel solutions that enhance the performance of our systems. You will collaborate with cross-functional teams, leveraging your expertise to drive innovation and efficiency.

Mar 16, 2026
Apply
company
Full-time|$170K/yr - $170K/yr|On-site|San Francisco

Join Convex and revolutionize application development!At Convex, we are dedicated to transforming the way developers construct applications. Our mission is to fundamentally reshape software development on the Internet by enabling developers to create fast, reliable, and dynamic applications without the need for backend teams. Our comprehensive full-stack app platform is meticulously designed with database, compute, and backend abstractions, allowing developers and LLMs to innovate quickly while ensuring that products remain scalable and manageable over time.About Our Team:Our team comprises engineers who have designed and built some of the most significant backends worldwide, handling exabytes of data and millions of transactions per second, while delivering desktop and mobile software to billions of devices. We are a friendly, collaborative, and passionate group that thrives on working together in our San Francisco office.Your Role:At Convex, we have a vast product surface area that includes our dashboard, insights, third-party integrations, project management, billing systems, email services, logging, and streaming, all of which are crucial to our customers’ businesses. These products need to be robust, reliable, intuitive, and enjoyable to use. We seek engineers who are passionate about creating excellent product experiences and expanding our offerings, particularly as we advance into higher markets.If you are an engineer with a keen design sensibility who values quality and is skilled at prioritization while collaborating directly with customers and business teams, you will likely be an excellent fit for our team. This role is for those who aspire to do impactful work and utilize product insights to determine the “why” and “what” of our development process, not just the “how”.Your Responsibilities Include:Designing, building, and maintaining Convex’s platforms, dashboard UI, integrations, billing, and other services.Collaborating directly with customers and leadership to define feature development plans.Developing a deep understanding of customer needs and business objectives to make informed tradeoffs and prioritize effectively.Establishing best practices and reliability standards as our team and systems grow.Contributing to a culture of excellence in product development.

Dec 18, 2025
Apply
companyHayden AI logo
Full-time|On-site|San Francisco HQ Office

Hayden AI creates mobile perception systems that help transit agencies and city governments solve transportation challenges. The team’s technology supports bus lane and stop enforcement, improves street safety, and helps make transit systems more efficient and sustainable. This Senior Firmware Engineer role is based at Hayden AI’s San Francisco headquarters. The position sits within the Device Software team and focuses on the low-level software stack powering the company’s edge AI systems. Work centers on direct interaction with hardware to ensure reliable, high-performance operation in real-world settings. Responsibilities Develop and maintain Linux kernel modules and device drivers for embedded platforms. Integrate hardware and software layers for edge AI devices to ensure stable operation. Work closely with hardware engineers and other software teams to deliver high-performance solutions. Troubleshoot and resolve firmware issues in deployed environments. Requirements Significant experience with Linux kernel and device driver development. Strong background in embedded systems and low-level programming. Comfort working directly with hardware and debugging complex system interactions. Experience with edge AI or similar real-time systems is a plus.

Apr 27, 2026
Apply
companyDatabricks logo
Full-time|$190.9K/yr - $232.8K/yr|On-site|San Francisco, California

P-1285 About This Role Join our dynamic team at Databricks as a Staff Software Engineer specializing in GenAI Performance and Kernel. In this pivotal role, you will take charge of designing, implementing, and optimizing high-performance GPU kernels that drive our GenAI inference stack. Your expertise will lead the development of finely-tuned, low-level compute paths, balancing hardware efficiency with versatility, while mentoring fellow engineers in the intricacies of kernel-level performance engineering. Collaborating closely with machine learning researchers, systems engineers, and product teams, you will elevate the forefront of inference performance at scale. What You Will Do Lead the design, implementation, benchmarking, and maintenance of essential compute kernels (such as attention, MLP, softmax, layernorm, memory management) tailored for diverse hardware backends (GPU, accelerators). Steer the performance roadmap for kernel-level enhancements, focusing on areas like vectorization, tensorization, tiling, fusion, mixed precision, sparsity, quantization, memory reuse, scheduling, and auto-tuning. Integrate kernel optimizations seamlessly with higher-level machine learning systems. Develop and uphold profiling, instrumentation, and verification tools to identify correctness, performance regressions, numerical discrepancies, and hardware utilization inefficiencies. Conduct performance investigations and root-cause analyses to address inference bottlenecks, such as memory bandwidth, cache contention, kernel launch overhead, and tensor fragmentation. Create coding patterns, abstractions, and frameworks to modularize kernels for reuse, cross-backend compatibility, and maintainability. Influence architectural decisions to enhance kernel efficiency (including memory layout, dataflow scheduling, and kernel fusion boundaries). Guide and mentor fellow engineers focused on lower-level performance, conducting code reviews and establishing best practices. Collaborate with infrastructure, tooling, and machine learning teams to implement kernel-level optimizations in production and assess their impacts.

Jan 30, 2026
Apply
company
Full-time|On-site|San Francisco

At Magic, we are dedicated to creating safe artificial general intelligence (AGI) that propels humanity forward in tackling the most pressing global challenges. We believe that the most effective route to achieving safe AGI involves automating the research and code generation processes to enhance models and resolve alignment issues more reliably than humans can achieve independently. Our methodology incorporates cutting-edge pre-training at scale, domain-specific reinforcement learning (RL), ultra-long context capabilities, and optimized inference-time computations.Role OverviewIn your role as a Software Engineer on the Pre-training Systems team, you will be responsible for designing and managing the distributed infrastructure necessary for training Magic’s long-context models at scale.This position emphasizes large-scale model training utilizing extensive GPU clusters. You will operate at the intersection of deep learning and distributed systems, ensuring that training processes are efficient, reliable, and reproducible under extreme conditions.Magic’s long-context models present complex systems challenges, such as sustained memory usage, communication overhead across thousands of devices, long-duration jobs requiring fault tolerance, and efficient sequence packing within hardware limitations. You will take ownership of the systems that ensure large-scale pre-training is both stable and rapid.Your ContributionsScale distributed training across large GPU clusters, implementing data, tensor, and pipeline parallelism.Optimize communication strategies and gradient synchronization.Enhance checkpointing, fault tolerance, and job recovery mechanisms.Profile and resolve performance bottlenecks across computing, networking, and storage.Advance experiment reproducibility and orchestration workflows.Boost hardware utilization and overall training throughput.Collaborate with Kernel and Research teams to align model architecture with system capabilities.Qualifications We SeekSolid foundation in software engineering and distributed systems.Experience with training large models in multi-node GPU environments.In-depth understanding of parallelism techniques and performance trade-offs.Experience in debugging cross-layer issues within production ML systems.Demonstrated ownership mentality and capability to manage critical infrastructure.Proven track record in enhancing the performance or reliability of large-scale systems.

Feb 28, 2026
Apply
company
Full-time|$100K/yr - $550K/yr|Remote|San Francisco

At Magic.dev, we are on a groundbreaking mission to develop safe artificial general intelligence (AGI) that propels humanity forward in tackling the world’s most pressing challenges. We believe that the key to achieving safe AGI lies in automating research and code generation, enabling us to enhance models and resolve alignment issues more effectively than human efforts alone. Our cutting-edge strategy incorporates frontier-scale pre-training, domain-specific reinforcement learning, ultra-long context, and advanced inference-time computation to realize this vision.If you are passionate about contributing to this mission and possess high-energy and creativity, we invite you to explore potential opportunities with us, even if they are not currently listed on our careers page. We welcome exceptional talent with open arms.

Feb 14, 2024
Apply
companyDatabricks logo
Full-time|$166K/yr - $225K/yr|On-site|San Francisco, California

At Databricks, we are dedicated to empowering data teams to tackle the world's most challenging problems, from detecting security threats to advancing cancer drug development. We achieve this by offering the premier data and AI platform, allowing our customers to concentrate on their mission-critical challenges. The Mosaic AI organization assists companies in developing AI models and systems utilizing their own data, employing technologies that range from training large language models (LLMs) from the ground up to employing advanced retrieval methods for enhanced generation. We pride ourselves on pushing the boundaries of science and operationalizing our innovations. Mosaic AI believes that a company’s AI models hold intrinsic value, akin to any other core intellectual property, and that superior AI models should be accessible to all. Job Overview As a research engineer in the Scaling team, you will stay abreast of the latest advancements in deep learning and pioneer new methodologies that surpass the current state of the art. You will collaborate with a diverse team of researchers and engineers, sharing insights and expertise. Most importantly, you will be passionate about our customers, striving to ensure their success in implementing cutting-edge LLMs and AI systems by translating our scientific knowledge into practical applications. Your Impact Enhance performance through innovative optimization techniques, including kernel fusion, mixed precision, memory layout optimization, tiling strategies, and tensorization tailored for training-specific patterns. Design, implement, and optimize high-performance GPU kernels for training workloads, including attention mechanisms, custom layers, gradient computations, and activation functions, specifically for NVIDIA architectures. Create and implement distributed training frameworks for large language models, incorporating parallelism strategies (data, tensor, pipeline, ZeRO-based) and optimized communication patterns for gradient synchronization and collective operations. Profile, debug, and optimize comprehensive training workflows to pinpoint and resolve performance bottlenecks, utilizing memory optimization techniques such as activation checkpointing, gradient sharding, and mixed precision training.

Jan 30, 2026

Sign in to browse more jobs

Create account — see all 11,145 results

Tailoring 0 resumes

We'll move completed jobs to Ready to Apply automatically.