[comment]: # (markdown: { smartypants: true }) Data into Ed-Fi with
earthmover
+
lightbeam
Data Engineering @
###
earthmover
* a Python CLI tool * constructs Ed-Fi JSON from flat files based on a YAML configuration * supports transformations like join, filter, distinct, group by, and more * a JSON template is rendered for each row of transformed data
###
earthmover
configuration ```yaml [|3-4|6-12|14-23|25-37] version: 2 config: output_dir: ./ sources: courses: file: ./sources/Courses.csv header_rows: 1 schools: file: ./sources/Schools.csv header_rows: 1 transformations: courses: source: $sources.courses operations: - operation: join sources: - $sources.schools join_type: inner left_key: school_id right_key: school_id ... destinations: # a destination for each Ed-Fi resource and descriptor schools: source: $sources.schools template: ./templates/school.jsont extension: jsonl linearize: True courses: source: $transformations.courses template: ./templates/course.jsont extension: jsonl linearize: True ```
###
earthmover
template ```json [|2|18-20] { "courseCode": "{{course_code}}", "identificationCodes": [ { "courseIdentificationSystemDescriptor": "uri://ed-fi.org/CourseIdentificationSystemDescriptor#LEA course code", "identificationCode": "{{course_code}}" } ], "educationOrganizationReference": { "educationOrganizationId": {{district_id}} }, "academicSubjectDescriptor": "uri://ed-fi.org/AcademicSubjectDescriptor#{{academic_subject}}", "courseDefinedByDescriptor": "uri://ed-fi.org/CourseDefinedByDescriptor#LEA", "courseDescription": "{{course_name}}", "courseGPAApplicabilityDescriptor": "uri://ed-fi.org/CourseGPAApplicabilityDescriptor#{{gpa_weight}}", "courseTitle": "{{course_name}}", "levelCharacteristics": [ {% if is_ap==1 %} { ... } {% endif %} ], "numberOfParts": 1 "offeredGradeLevels": [ ... ] } ```
JSON + Jinja templating languge
###
lightbeam
* also a Python CLI tool * processes JSONL files * validates JSON based on Ed-Fi API's Swagger docs * sends JSON to an Ed-Fi API in dependency-order
###
lightbeam
configuration ```yaml [|1|2-8|7-8|9-15] data_dir: ./ edfi_api: base_url: https://api.schooldistrict.org/v5.3/api version: 3 mode: year_specific year: 2021 client_id: ${EDFI_API_CLIENT_ID} client_secret: ${EDFI_API_CLIENT_SECRET} connection: pool_size: 8 timeout: 60 num_retries: 10 backoff_factor: 1.5 retry_statuses: [429, 500, 502, 503, 504] verify_ssl: True ```
Putting it all together ```bash pip install earthmover lightbeam earthmover run -c path/to/earthmover.yaml lightbeam validate+send -c path/to/lightbeam.yaml ```
(requires external orchestration - CRON, Airflow, Dagster, etc.)
### Features *
earthmover
visualizes data lineage ![media/dag.png](media/dag.png)
### Features * selectors: process only some descriptors/resources ```bash earthmover run -c path/to/earthmover.yaml -s courses,student* lightbeam send -c path/to/lightbeam.yaml -s courses,student* ```
### Features * use environment variables or command-line parameters (which override env vars) ```bash earthmover run -c path/to/earthmover.yaml -p '{\ "BASE_DIR":"path/to/base/dir"\ }' lightbeam send -c path/to/lightbeam.yaml -p '{\ "CLIENT_ID":"populated",\ "CLIENT_SECRET":"populatedSecret"\ }' ```
### Features *
earthmover
source data quality expectations ```yaml # earthmover sources: schools: file: ./sources/Schools.csv header_rows: 1 expect: - low_grade != '' - high_grade != '' - low_grade|int <= high_grade|int ```
(run fails if expectations not met)
### Features *
earthmover
state tracking: only re-process if source files change (based on file hash) ```yaml # earthmover config: state_file: ~/.earthmover.csv ```
### Features *
lightbeam
state tracking: selectively resend payloads ```yaml # lightbeam state_dir: ~/.lighbeam/ ``` ```bash lightbeam send -c path/to/config.yaml --newer-than 2020-12-25T00:00:00 ```
### Features *
earthmover
supports source files larger than memory (via dask) * tested with attendance data of 3GB+
earthmover
+
lightbeam
are public,
open-source GitHub repositories available
under the Apache 2.0 license. Extensive docs and examples are
available at the project repos.
We've also translated ~20 existing Data Import Tool mappings for assessment data into reusable
earthmover Ed-Fi bundles
which will be open-sourced soon. ![media/earthmover-assessment-bundles.png](media/earthmover-assessment-bundles.png)
### Conclusion * EA is already using
earthmover
+
lightbeam
* works nicely in data pipelines (Airflow, Dagster) * most of the work is building the JSON templates and YAML transformations * we're working with a partner to do custom mapping for flat data they have from a SIS they don't own * other use-cases include - pre-populating Ed-Fi with custom descriptors - converting data to Ed-Fi for analytics without needing an Ed-Fi API/ODS