[Remote] Data Engineer (Healthcare)
Note: The job is a remote job and is open to candidates in USA. Prime Health Technologies is redefining healthcare with its AI-driven Precision Health Operating System aimed at improving population health outcomes. The Data Engineer will design and operate the platform's data infrastructure, ensuring reliable data flows and compliance with regulations. This hands-on role requires building data pipelines and supporting MLOps as the platform matures.ResponsibilitiesBuild & maintain reliable batch and (where appropriate) streaming pipelines for clinical, operational, product, and third-party data sources, including healthcare & consumer-health integrations: HL7/FHIR, REST APIs, Apple HealthKit, Android Health Connect, and governed adapters for external clinical or wellness sourcesDesign data models, transformations, and storage patterns supporting analytics, reporting, AI workloads, and product features — with reproducibility as a first-class requirement (any curated dataset must rebuild deterministically from raw inputs & transformation code)Design & operate core stores in the in-country PHI data plane (operational database, time-series store, object storage, audit logs) with encryption, access control, and lifecycle managementBuild curated, de-identified-by-default analytics datasets powering operational, regulatory, and client dashboardsImplement & maintain PHI/PII de-identification & tokenization pipelines; support tightly controlled re-identification workflows when explicitly authorizedEstablish data quality, integrity, and observability controls (validation, reconciliation, idempotency, late-arriving data handling, lineage, monitoring, alerting) and publish quality metricsDeliver a discoverable metadata layer so teams can self-serve and trust datasetsSupport sovereign / regional data-residency models, keeping PHI within an approved deployment boundary while enabling derived & aggregate views in out-of-country planesOwn pipeline observability — logging, metrics, tracing, alerting, cost & performance tuning — across the stackContribute to CI/CD for data components & participate in incident response and postmortemsPartner with engineering, product, clinical, and business stakeholders translating data needs into scalable technical solutionsTraining-job orchestration & reproducible dataset versioningModel registry & artifact storageContainerized model serving, routing, and shadow-deployment infrastructureInference logging back into the warehouse for downstream evaluationCI/CD for model artifacts (schema validation, contract tests, automated rollouts)Skills7+ years in data engineering or backend engineering with significant data-pipeline ownership; substantial seniority is expected given the regulated, national-scale, and sovereign-deployment context. Prior work in healthcare, wellness, insurance, or other regulated domainsStrong SQL & Python; proven track record building reliable ETL/ELT pipelines in productionExperience with modern storage patterns: operational databases, data lakes / object storage, and analytics warehouses or lakehousesHands-on experience with orchestration tools (Airflow, Dagster, Prefect, or equivalent) and transformation frameworks (dbt or equivalent)Demonstrated discipline around data contracts, schema evolution, and reproducible pipelines (deterministic rebuilds from raw + code)Experience working with sensitive data (PII/PHI), implementing least-privilege access patterns, audit logging, and consent-aware data accessFamiliarity with data classification, retention, deletion, and auditability requirements for sensitive dataExperience with data quality & observability practices: validation/testing, lineage/metadata, monitoring/alerting, incident responseClear written & verbal communication; able to produce data documentation, runbooks, and pragmatic design proposalsExperience supporting audits & control evidence in ISO 27001-aligned environments; familiarity with ISO 42001 AI governance expectations & privacy regimes such as HIPAA & GDPRExposure to HL7/FHIR or common clinical code sets (ICD, SNOMED) and the realities of integrating heterogeneous health datasetsPrior work in data-residency / sovereign-cloud environments with split-plane architectures (in-country PHI plane plus out-of-country derived/aggregate views)Experience with time-series databases & high-volume sensor / wearable data pipelinesGrowth direction — MLOps: experience extending data platforms with ML infrastructure (training orchestration, model registries, feature-pipeline runtimes with batch-to-online parity, containerized serving, inference logging). Candidates who have collaborated closely with data science teams and have intuition for what makes good scaffolding for that workflow are particularly valuable, since this role will grow into MLOps as the platform matures toward pilotsCompany OverviewWe are building the intelligence layer for proactive health at scale, turning biometric, behavioral, diagnostic, and lifestyle data into personalized daily guidance that helps people improve how they live, age, and perform. It was founded in undefined, and is headquartered in Tampa, Florida, US, with a workforce of 2-10 employees. Its website is https://primehealthtechnologies.com/.