topic: Python and Spark

Hard Prerequisites
IMPORTANT: Please review these prerequisites, they include important information that will help you with this content.
  • PROJECTS: Data Wrangling
  • PROJECTS: Understanding map reduce
  • SQL: Shop Database using sql
  • As a Data Engineer, you will be required to process large data sets for various reasons. Apache Spark is an open-source general-purpose distributed processing system used for processing big data.

    Apache Spark is written in Scala, but can be controlled using a package called PySpark.

    Resources

    Real Python has a great introduction.

    This is a good tutorial to get you started with PySpark. It’ll take you from zero to hero.


    RAW CONTENT URL