{"id":822,"date":"2024-11-19T01:55:46","date_gmt":"2024-11-19T01:55:46","guid":{"rendered":"https:\/\/192.168.1.3\/wordpress\/?p=822"},"modified":"2024-12-16T01:33:18","modified_gmt":"2024-12-16T01:33:18","slug":"aws-certified-data-engineer-associate-dea-c01-review-material-fundamentals","status":"publish","type":"post","link":"https:\/\/mylinuxsite.com\/wordpress\/?p=822","title":{"rendered":"AWS Certified Data Engineer Associate (DEA-C01) Review Material &#8211; Fundamentals"},"content":{"rendered":"\n<!--more-->\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Types of Data<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Structured<\/strong><ul><li>It&nbsp;has a standardized format for efficient access by software and humans alike. <\/li><li>It is typically tabular, with rows and columns that clearly define data attributes, such as in databases and CSV files.<\/li><li>It can be queried easily.<\/li><li>It has&nbsp;the same attributes for all data values.<\/li><\/ul><\/li><li><strong>Unstructured<\/strong><ul><li>It has no predefined schema or attributes.<\/li><li>It comes in different formats, such as audio and video files, emails, and large text documents.<\/li><\/ul><\/li><li><strong>Semi-structured:<\/strong><ul><li>Sits between structured data and unstructured data.&nbsp;<\/li><li>It is a blend of structured and unstructured data types, which<strong> <\/strong>makes the data semi-structured.&nbsp;<\/li><li>Examples are XML, JSON and Log files.<\/li><\/ul><\/li><\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Properties of Data<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>The Three (3) V&#8217;s:<\/strong> (1)Volume, (2) Velocity, and  (3) Variety<\/li><li><strong>Volume:<\/strong><ul><li>Refers to the amount of data<\/li><\/ul><\/li><li><strong>Velocity:<\/strong><ul><li>Refers to the speed at which the data is received, collected and processed.<\/li><\/ul><\/li><li><strong>Variety:<\/strong><ul><li>Refers to the different types of data, e.q.<ul><li>ORC<ul><li>Columnar storage<\/li><\/ul><ul><li>provides an efficient way to store Hive data. ORC files<\/li><li>Often smaller than Parquet files, <\/li><li>ORC indexes can make querying faster. <\/li><li>Supports complex types such as structs, maps, and lists.<\/li><\/ul><\/li><li>Parquet<ul><li>Columnar storage<\/li><li>Efficient data compression and encoding schemes.<\/li><li>It is ideal for running complex queries and processing large amounts of data.\u00a0<\/li><\/ul><\/li><li>Avro<ul><li>An open-source object container file format. <\/li><li>Row-based storage. <\/li><li>Stores data definition in JSON so data can be easily read and interpreted.<\/li><\/ul><\/li><\/ul><\/li><\/ul><\/li><\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Data Warehousing<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Data Warehouse<\/strong>:<ul><li>Centralized storage of structured data.<\/li><li>Data comes from multiple resources.<\/li><li>Used for complex query, analysis, and BI.<\/li><li>Usually used Star or Snowflake schema.<\/li><li><strong>Schema-on-write<\/strong><\/li><\/ul><\/li><li><strong>Data Lake:<\/strong><ul><li>Centralized storage of data that is either structured, unstructured or semi-structured at a scale.<\/li><li>Data are usually stored in its original form.<\/li><li>It can accommodate all types of data.<\/li><li>Examples are S3 and HDFS.<\/li><li>Used for analytics and machine learning.<\/li><li><strong>Schema-on-read<\/strong><\/li><\/ul><\/li><li><strong>Data Lakehouses:<\/strong><ul><li>Combines the features of Data Warehouse and Data Lake.<\/li><li>Stores structured, unstructured or semi-structured data at a scale.<\/li><li><strong>Schema-on-write<\/strong> and <strong>Schema-on-read<\/strong><\/li><\/ul><\/li><\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Data Sampling<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Random Sampling:<\/strong><ul><li>A random sample from your dataset is chosen so that each element has an equal probability of being selected.<\/li><li>Use random sampling if you want to do quick approximate calculations to understand your dataset.<\/li><li>The random samples may not include all outliers and edge cases.<\/li><\/ul><\/li><li><strong>Stratified<\/strong>:<ul><li>Data are divided into strata based on particular characteristics or criteria.<\/li><li>The size of each strata in the sample is proportional to the size of the strata in the population.&nbsp;<\/li><li>Useful for understanding how different groups in your data compare.<\/li><li>Ensure appropriate representation of each group.<\/li><li>Appropriate for heterogeneous populations.<\/li><\/ul><\/li><li><strong>Systematic<\/strong><em>:<\/em><ul><li>Random starting point but selecting&nbsp;<em>data<\/em>&nbsp;points at regular intervals.<\/li><\/ul><\/li><li><strong>Cluster:<\/strong><ul><li>Data is divided into smaller groups called clusters. <\/li><li>Randomly select clusters to form a sample, <em><strong>then take all the samples from the selected group.<\/strong><\/em><\/li><li>Appropriate for populations that are geographically distributed<\/li><\/ul><\/li><\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Data Skewness<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\"><li>Refers to the asymmetry in the distribution of values around the mean.<\/li><li>Data skew occurs when processed data is not evenly distributed across the mean.<\/li><li>Negative Skew &#8211; long left tail<\/li><li>Positive Skew &#8211; long right tail<\/li><li><a href=\"https:\/\/en.wikipedia.org\/wiki\/File:Negative_and_positive_skew_diagrams_(English).svg\">https:\/\/en.wikipedia.org\/wiki\/File:Negative_and_positive_skew_diagrams_(English).svg<\/a><\/li><\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Data Validation and Profiling<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Data Validation:<\/strong><ul><li>Data is accurate, complete, consistent, and adheres to the predefined schema or structure.<\/li><\/ul><\/li><li><strong>Data Profiling:<\/strong><ul><li>The process of examining, analyzing and understanding the characteristics, quality and structure of the data.<\/li><\/ul><\/li><\/ul>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[11],"tags":[],"class_list":["post-822","post","type-post","status-publish","format-standard","hentry","category-aws-review-notes"],"_links":{"self":[{"href":"https:\/\/mylinuxsite.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/822","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mylinuxsite.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mylinuxsite.com\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mylinuxsite.com\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mylinuxsite.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=822"}],"version-history":[{"count":24,"href":"https:\/\/mylinuxsite.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/822\/revisions"}],"predecessor-version":[{"id":1368,"href":"https:\/\/mylinuxsite.com\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/822\/revisions\/1368"}],"wp:attachment":[{"href":"https:\/\/mylinuxsite.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=822"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mylinuxsite.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=822"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mylinuxsite.com\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=822"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}