[Remote] Senior Cloud Engineer
Note: The job is a remote job and is open to candidates in USA. Onyx Visual Effects is a company specializing in visual effects and cloud infrastructure. They are seeking a Senior Cloud Engineer to manage AWS services, optimize cloud resources for VFX workloads, and ensure compliance with security standards while collaborating with global teams.ResponsibilitiesProficiency in AWS core services, including EC2 for compute, EFS/S3/EBS for storage, VPC networking, Security Groups, NACLs, Route 53, and Direct Connect for low-latency remote accessIncludes managing instance failures during long-running renders, handling multi-AZ outages with failover, optimizing for global teams, and integrating with on-premises legacy hardwareSpecialization in MPA compliance and security-first engineering, including AWS KMS encryption, access logging, Trusted Partner Network assessments, and zero-trust modelsIncludes adapting to evolving MPA guidelines, managing sensitive IP with external studios, handling data sovereignty requirements, and responding to vulnerabilities in media workflowsExperience with AWS VFX solutions like Thinkbox Deadline, Deadline Cloud, Nimble Studio, and EC2 Spot/GPU instances for cost-effective renderingIncludes scaling farms for 8K+ projects, recovering from spot interruptions, troubleshooting custom VFX plugins, and optimizing hybrid CPU/GPU workloadsIdentity and Access Management with role-based controls, MFA, and integration with directory servicesIncludes onboarding/offboarding remote users, federated logins from third-party IDPs, managing privilege escalation risks, and auditing access logs for anomalous behaviorCost optimization using AWS Cost Explorer, Savings Plans, Reserved Instances, and auto-scaling groups for variable VFX workloadsIncludes forecasting burst render costs, mitigating overspending from misconfigured scaling, and tracking costs across multiple projectsData transfer tools like AWS Snowball and DataSync for asset migrations, plus multi-tier storage strategies such as S3 Intelligent-TieringIncludes large-scale transfers, partial sync recovery, encryption integrity, and cold storage retrieval planningAWS certifications such as Solutions Architect or SysOps Administrator, with the ability to apply certification knowledge to custom VFX scenarios, real-time collaboration setups, renewals, and edge deployments such as AWS OutpostsExpertise in Rocky Linux, Redhat-based OS, Windows, and macOS command-line and general administration, including cross-platform scripting with Bash and PowerShellIncludes troubleshooting Linux kernel issues, macOS driver conflicts, Windows updates, and mixed-OS fleetsInfrastructure as Code with Terraform, AWS CloudFormation, or Ansible for provisioning and automationIncludes idempotent deployments, rolling back failed IaC changes during live productions, version control collaboration, and provider quirksMonitoring and logging with AWS CloudWatch, X-Ray, and integrations like ELK Stack for metrics, alarms, and proactive issue resolutionIncludes custom alarms for GPU utilization, tracing distributed render jobs, filtering high-volume logs, and SIEM integrationBackup and disaster recovery using AWS Backup, S3 versioning, and multi-region replicationIncludes testing restores for corrupted VFX assets, managing RTO/RPO in outages, automating failover drills, and handling version conflictsNetworking and security operations, including VPN, firewalls, AWS GuardDuty, and high-performance network-attached storageIncludes mobile artist VPN access, detecting network attacks, optimizing NAS for 4K/8K streaming, and securing third-party integrationsVirtual machine management and containerization with Docker, ECS, or Kubernetes for portable VFX applicationsIncludes bursty simulations, pod evictions during resource contention, GPU passthrough, and network policy debuggingProficiency with core VFX software like Nuke, ZBrush, Maya, V-Ray, Houdini, Redshift, Arnold, RenderMan, and OctaneIncludes optimizing for non-standard hardware, troubleshooting batch-mode plugin crashes, integrating emerging AI tools, and handling license server failuresRender farm management using AWS Deadline Cloud, PipelineFX Qube, or custom scripts for job distribution and optimizationIncludes prioritizing jobs during overlapping deadlines, recovering orphaned tasks, scaling to thousands of nodes, and integrating hybrid cloud/off-cloud farmsPipeline tools including asset management systems such as ShotGrid or ftrack, version control with Perforce or Git, and CI/CD for artist workflowsIncludes merging conflicting asset versions, handling large binary files, automating plugin testing, and securing pipelines against IP leaksPerformance tuning for GPU/CPU workloads, memory management in simulations, and benchmarking to reduce render timesIncludes managing OOM errors in Houdini sims, comparing instance types, and optimizing cost/performance trade-offsTroubleshooting application issues, OS problems, and providing deskside, phone, and ticket support to VFX artists and production teamsIncludes remote debugging, VPN-disrupted sessions, vendor escalation, and documenting repeatable fixesExperience with HP Connect Anywhere, PCoIP desktop environments, NICE DCV, and AWS AppStream for low-latency streaming and multi-monitor supportIncludes high-DPI displays, transcontinental latency, session security, and VR/AR review workflowsNVIDIA CUDA drivers, GRID/AMDGPU management in EC2 instances, and virtual workstations for color-accurate VFX workIncludes driver updates, CUDA version mismatches, color calibration over compressed streams, and experimental AMD setupsSecure file sharing via AWS Transfer Family and real-time collaboration tools such as Frame.io integrationsIncludes enforcing upload quotas, recovering interrupted transfers, auditing shares, and custom encryption for sensitive dailiesWEKA Storage Solutions integration with AWS for high-I/O VFX tasks such as 4K/8K footageIncludes scaling IOPS for parallel artist access, handling filesystem issues, optimizing mixed read/write patterns, and migrating from legacy storageAdvanced storage strategies, including lifecycle policies for archiving and handling large media filesIncludes tier transitions, retention policies, legal holds, accidental deletion recovery, snapshots, and cost optimization for growing project dataScripting and programming in Python, Bash, or similar for automation, system tasks, and DevOps practicesIncludes resilient scripts for flaky APIs, exception handling in long-running automations, VFX-specific libraries, and secure handling of user inputConfiguration management, deployment tools, and CI/CD pipeline buildingIncludes managing config drift, zero-downtime deployments, troubleshooting branched pipeline failures, and securing secrets in CI/CD environmentsStrong problem-solving, critical thinking, and root cause analysis for render failures and remote issuesIncludes diagnosing cascading failures, intermittent bugs, post-mortems with non-technical stakeholders, and adapting solutions to evolving tech stacksExcellent communication, teamwork, and ability to consult, train, and build relationships with remote artists, producers, and vendorsIncludes bridging time zones, supporting high-stress deadlines, training via screen share, and negotiating SLAs during outagesSelf-motivated, proactive, and committed to continuous learning, including AWS trends and VFX innovations like AI-assisted renderingIncludes self-teaching during rapid tech shifts, identifying bottlenecks before escalation, and testing beta features in sandboxesExperience in vendor management and shift work flexibility for global remote operationsIncludes managing multi-vendor ecosystems, adapting to 24/7 on-call needs, negotiating custom integrations, and handling critical vendor escalationsSkillsProficiency in AWS core services, including EC2 for compute, EFS/S3/EBS for storage, VPC networking, Security Groups, NACLs, Route 53, and Direct Connect for low-latency remote accessSpecialization in MPA compliance and security-first engineering, including AWS KMS encryption, access logging, Trusted Partner Network assessments, and zero-trust modelsExperience with AWS VFX solutions like Thinkbox Deadline, Deadline Cloud, Nimble Studio, and EC2 Spot/GPU instances for cost-effective renderingIdentity and Access Management with role-based controls, MFA, and integration with directory servicesCost optimization using AWS Cost Explorer, Savings Plans, Reserved Instances, and auto-scaling groups for variable VFX workloadsData transfer tools like AWS Snowball and DataSync for asset migrations, plus multi-tier storage strategies such as S3 Intelligent-TieringAWS certifications such as Solutions Architect or SysOps AdministratorExpertise in Rocky Linux, Redhat-based OS, Windows, and macOS command-line and general administrationInfrastructure as Code with Terraform, AWS CloudFormation, or Ansible for provisioning and automationMonitoring and logging with AWS CloudWatch, X-Ray, and integrations like ELK Stack for metrics, alarms, and proactive issue resolutionBackup and disaster recovery using AWS Backup, S3 versioning, and multi-region replicationNetworking and security operations, including VPN, firewalls, AWS GuardDuty, and high-performance network-attached storageVirtual machine management and containerization with Docker, ECS, or Kubernetes for portable VFX applicationsProficiency with core VFX software like Nuke, ZBrush, Maya, V-Ray, Houdini, Redshift, Arnold, RenderMan, and OctaneRender farm management using AWS Deadline Cloud, PipelineFX Qube, or custom scripts for job distribution and optimizationPipeline tools including asset management systems such as ShotGrid or ftrack, version control with Perforce or Git, and CI/CD for artist workflowsPerformance tuning for GPU/CPU workloads, memory management in simulations, and benchmarking to reduce render timesTroubleshooting application issues, OS problems, and providing deskside, phone, and ticket support to VFX artists and production teamsExperience with HP Connect Anywhere, PCoIP desktop environments, NICE DCV, and AWS AppStream for low-latency streaming and multi-monitor supportNVIDIA CUDA drivers, GRID/AMDGPU management in EC2 instances, and virtual workstations for color-accurate VFX workSecure file sharing via AWS Transfer Family and real-time collaboration tools such as Frame.io integrationsWEKA Storage Solutions integration with AWS for high-I/O VFX tasks such as 4K/8K footageAdvanced storage strategies, including lifecycle policies for archiving and handling large media filesScripting and programming in Python, Bash, or similar for automation, system tasks, and DevOps practicesConfiguration management, deployment tools, and CI/CD pipeline buildingStrong problem-solving, critical thinking, and root cause analysis for render failures and remote issuesExcellent communication, teamwork, and ability to consult, train, and build relationships with remote artists, producers, and vendorsSelf-motivated, proactive, and committed to continuous learning, including AWS trends and VFX innovations like AI-assisted renderingExperience in vendor management and shift work flexibility for global remote operationsCompany OverviewFounded in 2021, OnyxVFX is one of the first of its kind to be a fully virtual visual effects studio providing end to end content creation fully in the cloud allowing Onyx to be creative, efficient, and agile and while maintaining full control over data security. It was founded in 2021, and is headquartered in Sherman Oaks, CA, US, with a workforce of 11-50 employees. Its website is https://www.onyxvfx.com.