Skip to content
Types of Data
- Structured
- It has a standardized format for efficient access by software and humans alike.
- It is typically tabular, with rows and columns that clearly define data attributes, such as in databases and CSV files.
- It can be queried easily.
- It has the same attributes for all data values.
- Unstructured
- It has no predefined schema or attributes.
- It comes in different formats, such as audio and video files, emails, and large text documents.
- Semi-structured:
- Sits between structured data and unstructured data.
- It is a blend of structured and unstructured data types, which makes the data semi-structured.
- Examples are XML, JSON and Log files.
Properties of Data
- The Three (3) V’s: (1)Volume, (2) Velocity, and (3) Variety
- Volume:
- Refers to the amount of data
- Velocity:
- Refers to the speed at which the data is received, collected and processed.
- Variety:
- Refers to the different types of data, e.q.
- ORC
- provides an efficient way to store Hive data. ORC files
- Often smaller than Parquet files,
- ORC indexes can make querying faster.
- Supports complex types such as structs, maps, and lists.
- Parquet
- Columnar storage
- Efficient data compression and encoding schemes.
- It is ideal for running complex queries and processing large amounts of data.
- Avro
- An open-source object container file format.
- Row-based storage.
- Stores data definition in JSON so data can be easily read and interpreted.
Data Warehousing
- Data Warehouse:
- Centralized storage of structured data.
- Data comes from multiple resources.
- Used for complex query, analysis, and BI.
- Usually used Star or Snowflake schema.
- Schema-on-write
- Data Lake:
- Centralized storage of data that is either structured, unstructured or semi-structured at a scale.
- Data are usually stored in its original form.
- It can accommodate all types of data.
- Examples are S3 and HDFS.
- Used for analytics and machine learning.
- Schema-on-read
- Data Lakehouses:
- Combines the features of Data Warehouse and Data Lake.
- Stores structured, unstructured or semi-structured data at a scale.
- Schema-on-write and Schema-on-read
Data Sampling
- Random Sampling:
- A random sample from your dataset is chosen so that each element has an equal probability of being selected.
- Use random sampling if you want to do quick approximate calculations to understand your dataset.
- The random samples may not include all outliers and edge cases.
- Stratified:
- Data are divided into strata based on particular characteristics or criteria.
- The size of each strata in the sample is proportional to the size of the strata in the population.
- Useful for understanding how different groups in your data compare.
- Ensure appropriate representation of each group.
- Appropriate for heterogeneous populations.
- Systematic:
- Random starting point but selecting data points at regular intervals.
- Cluster:
- Data is divided into smaller groups called clusters.
- Randomly select clusters to form a sample, then take all the samples from the selected group.
- Appropriate for populations that are geographically distributed
Data Skewness
Data Validation and Profiling
- Data Validation:
- Data is accurate, complete, consistent, and adheres to the predefined schema or structure.
- Data Profiling:
- The process of examining, analyzing and understanding the characteristics, quality and structure of the data.