AWS Certified Data Engineer Associate (DEA-C01) Review Material – Fundamentals

Structured
- It has a standardized format for efficient access by software and humans alike.
- It is typically tabular, with rows and columns that clearly define data attributes, such as in databases and CSV files.
- It can be queried easily.
- It has the same attributes for all data values.
Unstructured
- It has no predefined schema or attributes.
- It comes in different formats, such as audio and video files, emails, and large text documents.
Semi-structured:
- Sits between structured data and unstructured data.
- It is a blend of structured and unstructured data types, which makes the data semi-structured.
- Examples are XML, JSON and Log files.

The Three (3) V’s: (1)Volume, (2) Velocity, and (3) Variety
Volume:
- Refers to the amount of data
Velocity:
- Refers to the speed at which the data is received, collected and processed.
Variety:
- Refers to the different types of data, e.q.
  - ORC
    - Columnar storage
    - provides an efficient way to store Hive data. ORC files
    - Often smaller than Parquet files,
    - ORC indexes can make querying faster.
    - Supports complex types such as structs, maps, and lists.
  - Parquet
    - Columnar storage
    - Efficient data compression and encoding schemes.
    - It is ideal for running complex queries and processing large amounts of data.
  - Avro
    - An open-source object container file format.
    - Row-based storage.
    - Stores data definition in JSON so data can be easily read and interpreted.

Data Warehouse:
- Centralized storage of structured data.
- Data comes from multiple resources.
- Used for complex query, analysis, and BI.
- Usually used Star or Snowflake schema.
- Schema-on-write
Data Lake:
- Centralized storage of data that is either structured, unstructured or semi-structured at a scale.
- Data are usually stored in its original form.
- It can accommodate all types of data.
- Examples are S3 and HDFS.
- Used for analytics and machine learning.
- Schema-on-read
Data Lakehouses:
- Combines the features of Data Warehouse and Data Lake.
- Stores structured, unstructured or semi-structured data at a scale.
- Schema-on-write and Schema-on-read

Random Sampling:
- A random sample from your dataset is chosen so that each element has an equal probability of being selected.
- Use random sampling if you want to do quick approximate calculations to understand your dataset.
- The random samples may not include all outliers and edge cases.
Stratified:
- Data are divided into strata based on particular characteristics or criteria.
- The size of each strata in the sample is proportional to the size of the strata in the population.
- Useful for understanding how different groups in your data compare.
- Ensure appropriate representation of each group.
- Appropriate for heterogeneous populations.
Systematic:
- Random starting point but selecting data points at regular intervals.
Cluster:
- Data is divided into smaller groups called clusters.
- Randomly select clusters to form a sample, then take all the samples from the selected group.
- Appropriate for populations that are geographically distributed

Refers to the asymmetry in the distribution of values around the mean.
Data skew occurs when processed data is not evenly distributed across the mean.
Negative Skew – long left tail
Positive Skew – long right tail
https://en.wikipedia.org/wiki/File:Negative_and_positive_skew_diagrams_(English).svg

Data Validation:
- Data is accurate, complete, consistent, and adheres to the predefined schema or structure.
Data Profiling:
- The process of examining, analyzing and understanding the characteristics, quality and structure of the data.