AWS Certified Data Engineer Associate (DEA-C01) Review Material – Fundamentals

Types of Data

  • Structured
    • It has a standardized format for efficient access by software and humans alike.
    • It is typically tabular, with rows and columns that clearly define data attributes, such as in databases and CSV files.
    • It can be queried easily.
    • It has the same attributes for all data values.
  • Unstructured
    • It has no predefined schema or attributes.
    • It comes in different formats, such as audio and video files, emails, and large text documents.
  • Semi-structured:
    • Sits between structured data and unstructured data. 
    • It is a blend of structured and unstructured data types, which makes the data semi-structured. 
    • Examples are XML, JSON and Log files.

Properties of Data

  • The Three (3) V’s: (1)Volume, (2) Velocity, and (3) Variety
  • Volume:
    • Refers to the amount of data
  • Velocity:
    • Refers to the speed at which the data is received, collected and processed.
  • Variety:
    • Refers to the different types of data, e.q.
      • ORC
        • Columnar storage
        • provides an efficient way to store Hive data. ORC files
        • Often smaller than Parquet files,
        • ORC indexes can make querying faster.
        • Supports complex types such as structs, maps, and lists.
      • Parquet
        • Columnar storage
        • Efficient data compression and encoding schemes.
        • It is ideal for running complex queries and processing large amounts of data. 
      • Avro
        • An open-source object container file format.
        • Row-based storage.
        • Stores data definition in JSON so data can be easily read and interpreted.

Data Warehousing

  • Data Warehouse:
    • Centralized storage of structured data.
    • Data comes from multiple resources.
    • Used for complex query, analysis, and BI.
    • Usually used Star or Snowflake schema.
    • Schema-on-write
  • Data Lake:
    • Centralized storage of data that is either structured, unstructured or semi-structured at a scale.
    • Data are usually stored in its original form.
    • It can accommodate all types of data.
    • Examples are S3 and HDFS.
    • Used for analytics and machine learning.
    • Schema-on-read
  • Data Lakehouses:
    • Combines the features of Data Warehouse and Data Lake.
    • Stores structured, unstructured or semi-structured data at a scale.
    • Schema-on-write and Schema-on-read

Data Sampling

  • Random Sampling:
    • A random sample from your dataset is chosen so that each element has an equal probability of being selected.
    • Use random sampling if you want to do quick approximate calculations to understand your dataset.
    • The random samples may not include all outliers and edge cases.
  • Stratified:
    • Data are divided into strata based on particular characteristics or criteria.
    • The size of each strata in the sample is proportional to the size of the strata in the population. 
    • Useful for understanding how different groups in your data compare.
    • Ensure appropriate representation of each group.
    • Appropriate for heterogeneous populations.
  • Systematic:
    • Random starting point but selecting data points at regular intervals.
  • Cluster:
    • Data is divided into smaller groups called clusters.
    • Randomly select clusters to form a sample, then take all the samples from the selected group.
    • Appropriate for populations that are geographically distributed

Data Skewness

Data Validation and Profiling

  • Data Validation:
    • Data is accurate, complete, consistent, and adheres to the predefined schema or structure.
  • Data Profiling:
    • The process of examining, analyzing and understanding the characteristics, quality and structure of the data.

Leave a Comment

Your email address will not be published. Required fields are marked *