Skip to content

Version Control Guide for Beginner Data Engineers

Version control using GitHub is essential for systematically managing collaboration and maintaining code quality in data engineering projects. When multiple team members work on the same codebase, issues such as overlooked updates or conflicting changes can frequently arise. Git helps minimize these issues by providing a clear record of code changes. Additionally, it is highly useful for quickly reverting to a previous state when incorrect changes are made. This guide is tailored for beginners like me who are new to Git and version control. 🙂

Git is commonly used with GitHub, one of the most widely adopted platforms for code storage and collaboration. GitHub offers various features such as Pull Requests and code reviews, making it an effective tool for sharing code changes and collaborating efficiently with team members.


Branch Naming Conventions and Management Strategies

Branch Naming Conventions

Branch names should clearly indicate the purpose and type of change. Examples of naming conventions include:

  • project/feature/name_of_feature: Used for developing new features.
  • project/fix/name_of_fix: Used for fixing bugs.
  • project/refactor/name_of_refactor: Used for making structural changes to the code.

For example, to develop a new pipeline feature in “project_a”:

git checkout -b project_a/feature/add_pipeline

Branch Management Strategies

When working on GitHub, there are two primary branches in a project:

  1. prod: The branch for stable code used in the production environment.
  2. dev: The branch for testing and development work.

New features should always begin from the prod branch, tested in dev, and then merged back into prod.

Steps for Creating and Managing Branches

  1. Create a Branch from prod:
    • The git checkout command is used to switch between branches or create a new branch. git checkout prod activates the prod branch, and the -b option creates a new branch.Example commands to create a branch for a new feature:
git checkout prod
git pull git checkout -b feature/topic_1

2. Commit Changes and Push to Remote Repository:

  • After writing code, save changes in the local repository:

git commit -am "Add new feature"
git push -u origin feature/topic_1

3. Merge and Test in dev:

  • Create a Pull Request (PR) to merge changes into dev.
  • If conflicts arise, resolve them in a new branch (e.g., feature/topic_1/dev).

4. Deploy to prod:

  • Once testing is complete, merge the changes from dev into prod.

    Why Always Start from prod?

    The prod branch contains the most stable and up-to-date code. When developing new features or fixing bugs, it is crucial to base your work on stable code to avoid unexpected errors. In contrast, the dev branch may include experimental or incomplete code, increasing the risk of complications. Thus, starting from prod ensures a more reliable foundation for your work.


    Guidelines for Commits and Pull Requests

    Writing Effective Commits

    Commit messages should be clear and concise, following these rules:

    • Write in the imperative mood: Describe what the commit will do.
      • Example: add google analytics extraction pipeline
      • Example: fix bug surplus revenue
    • Commit small and often: Save progress whenever the code is functional to enhance traceability.

    Creating Effective Pull Requests

    Pull Requests (PRs) are essential for collaboration and code review. Include the following details in a PR:

    1. Purpose and Context:
      • Example: “This PR adds a Google Analytics pipeline.”
    2. Relevant Links:
      • Ticket system links, documentation (e.g., dbt docs), or related PRs.
    3. Merge and Deployment Instructions:
      • Indicate if a full-refresh or any special actions are required.
    4. Review and Merge Criteria:
      • At least one team member approval.
      • All tests must pass.

    Conflict Resolution and Branch Cleanup

    Resolving Conflicts

    Conflicts can occur when merging a PR into dev. Follow these steps to resolve conflicts:

    1. Merge dev into feature/topic_1:
    git checkout feature/topic_1
    git merge dev
    1. Create a new branch for conflict resolution:
    git checkout -b feature/topic_1/dev
    1. Resolve conflicts and merge back into dev.

    Cleaning Up Branches

    Remove completed branches to maintain a clean codebase:

    • Delete local branches:
    git branch -d feature/topic_1
    • Delete remote branches:
    git push origin --delete feature/topic_1

    Version Control Guide Summary

    1. Create a New Branch:
      • Always start from prod.
    2. Commit Changes:
      • Save changes to the local repository.
    3. Share Your Work:
      • Push changes to the remote repository.
    4. Test in dev:
      • Create a PR and merge into dev for testing.
    5. Deploy to prod:
      • Merge verified code into prod.
    6. Clean Up Branches:
      • Remove local and remote branches after completing work.

    Using Git and version control effectively enhances collaboration and code quality in data engineering projects. Follow this guide to learn the basics of Git and apply it to your projects with confidence.