导图社区 DAMA DMBOK2.0全知识点总结(第13-17章 数据质量 大数据和数据科学 数据管理成熟度评估 数据管理组织与角色期望 数据管理和组织变革管理)
CDMP,全称Certified for Data Management Professional,即数据管理专业人士认证,由数据管理国际协会DAMA International建立,是一项涵盖学历教育、工作经验和专业知识考试在内的综合认证。 总结了CDMP英文考试的所有知识点,考点,以及历史真题。 适用于从事数据管理,数据治理,数字转型等方面的高级职业认证。 章节和知识点较多,因此分章节和完成时间分发。 (第1-3章 数字管理 数字伦理 数字治理) (第4-6章 数据架构 数据建模和设计 数据存储和操作) (第7-9章 数据安全 数据集成和互操作 文件和内容管理) (第10-12章 参考数据和主数据 数据仓库和商务智能 元数据管理) (第13-17章 数据质量 大数据和数据科学 数据管理成熟度评估 数据管理组织与角色期望 数据管理和组织变革管理) 考证 CDMP 数据管理 DMBOK 数字化转型 DAMA 数字化 数据管理专家
编辑于2023-03-29 13:09:58 广东CDMP,全称Certified for Data Management Professional,即数据管理专业人士认证,由数据管理国际协会DAMA International建立,是一项涵盖学历教育、工作经验和专业知识考试在内的综合认证。 总结了CDMP英文考试的所有知识点,考点,以及历史真题。 适用于从事数据管理,数据治理,数字转型等方面的高级职业认证。 章节和知识点较多,因此分章节和完成时间分发。 (第1-3章 数字管理 数字伦理 数字治理) (第4-6章 数据架构 数据建模和设计 数据存储和操作) (第7-9章 数据安全 数据集成和互操作 文件和内容管理) (第10-12章 参考数据和主数据 数据仓库和商务智能 元数据管理) (第13-17章 数据质量 大数据和数据科学 数据管理成熟度评估 数据管理组织与角色期望 数据管理和组织变革管理) 考证 CDMP 数据管理 DMBOK 数字化转型 DAMA 数字化 数据管理专家
CDMP,全称Certified for Data Management Professional,即数据管理专业人士认证,由数据管理国际协会DAMA International建立,是一项涵盖学历教育、工作经验和专业知识考试在内的综合认证。 总结了CDMP英文考试的所有知识点,考点,以及历史真题。 适用于从事数据管理,数据治理,数字转型等方面的高级职业认证。 章节和知识点较多,因此分章节和完成时间分发。 (第1-3章 数字管理 数字伦理 数字治理) (第4-6章 数据架构 数据建模和设计 数据存储和操作) (第7-9章 数据安全 数据集成和互操作 文件和内容管理) (第10-12章 参考数据和主数据 数据仓库和商务智能 元数据管理) (第13-17章 数据质量 大数据和数据科学 数据管理成熟度评估 数据管理组织与角色期望 数据管理和组织变革管理) 考证 CDMP 数据管理 DMBOK 数字化转型 DAMA 数字化 数据管理专家
CDMP,全称Certified for Data Management Professional,即数据管理专业人士认证,由数据管理国际协会DAMA International建立,是一项涵盖学历教育、工作经验和专业知识考试在内的综合认证。 总结了CDMP英文考试的所有知识点,考点,以及历史真题。 适用于从事数据管理,数据治理,数字转型等方面的高级职业认证。 章节和知识点较多,因此分章节和完成时间分发。 1-3章 4-6章 7-9章 10-12章 13-17章 考证 CDMP 数据管理 DMBOK 数字化转型 DAMA 数字化 数据管理专家
社区模板帮助中心,点此进入>>
CDMP,全称Certified for Data Management Professional,即数据管理专业人士认证,由数据管理国际协会DAMA International建立,是一项涵盖学历教育、工作经验和专业知识考试在内的综合认证。 总结了CDMP英文考试的所有知识点,考点,以及历史真题。 适用于从事数据管理,数据治理,数字转型等方面的高级职业认证。 章节和知识点较多,因此分章节和完成时间分发。 (第1-3章 数字管理 数字伦理 数字治理) (第4-6章 数据架构 数据建模和设计 数据存储和操作) (第7-9章 数据安全 数据集成和互操作 文件和内容管理) (第10-12章 参考数据和主数据 数据仓库和商务智能 元数据管理) (第13-17章 数据质量 大数据和数据科学 数据管理成熟度评估 数据管理组织与角色期望 数据管理和组织变革管理) 考证 CDMP 数据管理 DMBOK 数字化转型 DAMA 数字化 数据管理专家
CDMP,全称Certified for Data Management Professional,即数据管理专业人士认证,由数据管理国际协会DAMA International建立,是一项涵盖学历教育、工作经验和专业知识考试在内的综合认证。 总结了CDMP英文考试的所有知识点,考点,以及历史真题。 适用于从事数据管理,数据治理,数字转型等方面的高级职业认证。 章节和知识点较多,因此分章节和完成时间分发。 (第1-3章 数字管理 数字伦理 数字治理) (第4-6章 数据架构 数据建模和设计 数据存储和操作) (第7-9章 数据安全 数据集成和互操作 文件和内容管理) (第10-12章 参考数据和主数据 数据仓库和商务智能 元数据管理) (第13-17章 数据质量 大数据和数据科学 数据管理成熟度评估 数据管理组织与角色期望 数据管理和组织变革管理) 考证 CDMP 数据管理 DMBOK 数字化转型 DAMA 数字化 数据管理专家
CDMP,全称Certified for Data Management Professional,即数据管理专业人士认证,由数据管理国际协会DAMA International建立,是一项涵盖学历教育、工作经验和专业知识考试在内的综合认证。 总结了CDMP英文考试的所有知识点,考点,以及历史真题。 适用于从事数据管理,数据治理,数字转型等方面的高级职业认证。 章节和知识点较多,因此分章节和完成时间分发。 1-3章 4-6章 7-9章 10-12章 13-17章 考证 CDMP 数据管理 DMBOK 数字化转型 DAMA 数字化 数据管理专家
DAMA知识点 第13-17章
Chapter 13: Data Quality 数据质量

1. Introduction
1.1. Define
1.1.1. An assumption underlying assertions about the value of data is that the data itself is reliable and trustworthy. In other words, that it is of high quality.
1.1.2. As is the case with Data Governance and with data management as a whole, Data Quality Management is a program, not a project.
It will include both project and maintenance work, along with a commitment to communications and training.
1.1.3. Most importantly, the long-term success of data quality improvement program depends on getting an organization to change its culture and adopt a quality mindset.
24. which of these statements is true? A:Data Quality Management is a synonym for Data Governance B:Data Quality Management only addresses structured data C:Data Quality Management is the application of technology to data problems Answered D:Data Quality Management is usually a one-off project E:Data Quality Management is a continuous process 正确答案:E 你的答案:E 解析:13.1:与数据治理和整体数据管理一样,数据质量管理不是一个项目,而是一项持续性工作。它包括项目和维护工作,以及承诺进行沟通和培训。
1.2. Business Drivers
1.2.1. The business drivers for establishing a formal Data Quality Management program include:
1. Increasing the value of organizational data and the opportunities to use it
2. Reducing risks and costs associated with poor quality data
3. Improving organizational efficiency and productivity
4. Protecting and enhancing the organization’s reputation
5. 6. which of the following is NOT an objective of information quality improvement? A:Agile design B:Improve the reliability of business decisions C:customer satisfaction D:Everyone takes responsibility for data stewardship E:All 正确答案:A 你的答案:A 解析:A无关。13.1.1业务驱动因素建立正式数据质量管理的业务驱动因素包括:1)提高组织数据价值和数据利用的机会。2)降低低质量数据导致的风险和成本。3)提高组织效率和生产力。4)保护和提高组织的声誉
1.2.2. In addition, many direct costs are associated with poor quality data. For example,
1. Inability to invoice correctly
2. Increased customer service calls and decreased ability to resolve them
3. Revenue loss due to missed business opportunities
4. Delay of integration during mergers and acquisitions
5. Increased exposure to fraud
6. Loss due to bad business decisions driven by bad data
7. Loss of business due to lack of good credit standing
1.2.3. Still high quality data is not an end in itself. It is a means to organizational success.
1.3. Goals and Principles
1.3.1. Data Quality programs focus on these general goals:
1. Developing a governed approach to make data fit for purpose based on data consumers’ requirements
2. Defining standards and specifications for data quality controls as part of the data lifecycle
3. Defining and implementing processes to measure, monitor, and report on data quality levels
4. Identifying and advocating for opportunities to improve the quality of data, through changes to processes and systems and engaging in activities that measurably improve the quality of data based on data consumer requirements
1.3.2. Data Quality programs should be guided by the following principles:
1. Criticality 重要性:
A Data Quality program should focus on the data most critical to the enterprise and its customers. Priorities for improvement should be based on the criticality of the data and on the level ofrisk if data is not correct.
2. Lifecycle management 生命周期管理:
The quality of data should be managed across the data lifecycle, from creation or procurement through disposal. This includes managing data as it moves within and betweensystems (i.e., each link in the data chain should ensure data output is of high quality).
3. Prevention 预防:
The focus of a Data Quality program should be on preventing data errors and conditionsthat reduce the usability of data; it should not be focused on simply correcting records.
4. Root cause remediation: 根因修正
Improving the quality of data goes beyond correcting errors. Problems withthe quality of data should be understood and addressed at their root causes, rather than just theirsymptoms. Because these causes are often related to process or system design, improving data qualityoften requires changes to processes and the systems that support them.
5. Governance 治理配套:
Data Governance activities must support the development of high quality data and DataQuality program activities must support and sustain a governed data environment.
6. Standards-driven 标准驱动:
All stakeholders in the data lifecycle have data quality requirements. To the degreepossible, these requirements should be defined in the form of measurable standards and expectations against which the quality of data can be measured.
7. Objective measurement and transparency 客观测量和透明度:
Data quality levels need to be measured objectively and consistently. Measurements and measurement methodology should be shared with stakeholders sincethey are the arbiters of quality.
8. Embedded in business processes 嵌入业务流程:
Business process owners are responsible for the quality of data produced through their processes. They must enforce data quality standards in their processes.
9. Systematically enforced 系统强制执行:
System owners must systematically enforce data quality requirements.
10. Connected to service levels 与SLA关联:
Data quality reporting and issues management should be incorporated into Service Level Agreements (SLA).
1.4. Essential Concepts
1.4.1. Data Quality 数据质量
The term data quality refers both to the characteristics associated with high quality data and to the processes used to measure or improve the quality of data.
Data is of high quality to the degree that it meets the expectations and needs of data consumers. That is, if the data is fit for the purposes to which they want to apply it. It is of low quality if it is not fit for those purposes.
Data quality is thus dependent on context and on the needs of the data consumer.
One of the challenges in managing the quality of data is that expectations related to quality are not always known. Customers may not articulate them. Often, the people managing data do not even ask about these requirements.
This needs to be an ongoing discussion, as requirements change over time as business needs and external forces evolve.
1.4.2. Critical Data 关键数据
Most organizations have a lot of data, not all of which is of equal importance.
One principle of Data Quality Management is to focus improvement efforts on data that is most important to the organization and its customers. Doing so gives the program scope and focus and enables it to make a direct, measurable impact on business needs.
While specific drivers for criticality will differ by industry, there are common characteristics across organizations. Data can be assessed based on whether it is required by:
1. Regulatory reporting 监管报告
2. Financial reporting 财务报告
3. Business policy 商业政策
4. Ongoing operations 持续经营
5. Business strategy, especially efforts at competitive differentiation 商业战略,尤其是差异化
银行数据
ESG数据
1.4.3. Data Quality Dimensions 数据质量维度
A Data Quality dimension is a measurable feature or characteristic of data.
35. A Data Quality dimension is : A:a core concept in dimensional modelling B:a valid value in a list. C:a measurable feature or characteristic of data D:one aspect of data quality used extensively in data governance. E:the value of a particular piece of data 正确答案:C 你的答案:C 解析:13.1.3:3.数据质量维度数据质量维度是数据的某个可测量的特性。
Many leading thinkers in data quality have published sets of dimensions
1. The Strong-Wang framework (1996) focuses on data consumers’ perceptions of data. It describes 15 dimensions across four general categories of data quality:
Intrinsic DQ 内在的
Accuracy
Objectivity
Believability
Reputation
Contextual DQ 场景的
Value-added
Relevancy
Timeliness 及时的
Completeness
Appropriate amount of data 适量
Representational DQ 表达的
Interpretability
Ease of understanding
Representational consistency
Concise 简洁 representation
Accessibility DQ 访问
Accessibility
Access security
2. In Data Quality for the Information Age (1996), Thomas Redman formulated a set of data quality dimension rooted in data structure. Within these three general categories (data model, data values, representation), he describes more than two dozen dimensions. They include the following:
Data Model:
Content:
Relevance of data
The ability to obtain the values
Clarity of definitions
Level of detail:
Attribute granularity 颗粒度
Precision of attribute domains
Composition:
Naturalness 自然性: The idea that each attribute should have a simple counterpart in the real worldand that each attribute should bear on a single fact about the entity
Identify-ability 可识别性: Each entity should be distinguishable from every other entity
Homogeneity 同一性
Minimum necessary redundancy
Consistency:
Semantic consistency of the components of the model
Structure consistency of attributes across entity types
Reaction to change:
Robustness 健壮性
Flexibility
Data Values:
Accuracy
Completeness
Currency 时效性
Consistency
Representation:
Appropriateness
Interpretability 可解释性
Portability 可移植性
Format precision
Format flexibility
Ability to represent null values
Efficient use of storage
Physical instances of data being in accord with their formats
3. In Improving Data Warehouse and Business Information Quality (1999), Larry English presents a comprehensive set of dimensions divided into two broad categories: inherent and pragmatic. Inherent characteristics are independent of data use. Pragmatic characteristics are associated with data presentation and are dynamic; their value (quality) can change depending on the uses of data.
Inherent quality characteristics 固有质量特征
1. Definitional conformance
2. Completeness of values
3. Validity or business rule conformance
4. Accuracy to a surrogate source
5. Accuracy to reality
6. Precision
10. A characteristic of information quality that measures the degree of the data granularity is known A:aggregation B:precision 精确性 C:scale D:data set E:data set 正确答案:B 你的答案:D 解析:题解:Larry English,分为两大类别:固有特征和实用特征。固有特征与数据使用无关,实用特征是动态5)反映现实的准确性。6)精确性。
7. Non-duplication
8. Equivalence 等效性 of redundant or distributed data
9. Concurrency 并发性 of redundant or distributed data
Pragmatic quality characteristics
Accessibility
Timeliness
Contextual clarity
Usability
Derivation integrity 多源可整合性
Rightness or fact completeness
4. In 2013, DAMA UK produced a white paper describing six core dimensions of data quality:
1. Completeness: The proportion of data stored against the potential for 100%.
2. Uniqueness: No entity instance (thing) will be recorded more than once based upon how that thing isidentified.
3. Timeliness: The degree to which data represent reality from the required point in time.
4. Validity: Data is valid if it conforms to the syntax (format, type, range) of its definition.
5. Accuracy: The degree to which data correctly describes the ‘real world’ object or event beingdescribed.
6. Consistency: The absence of difference, when comparing two or more representations of a thingagainst a definition.
7. Usability: Is the data understandable, simple, relevant, accessible, maintainable and at the right levelof precision?
8. Timing issues (beyond timeliness itself): Is it stable yet responsive to legitimate change requests?
9. Flexibility: Is the data comparable and compatible with other data? Does it have useful groupings andclassifications? Can it be repurposed? Is it easy to manipulate?
10. Confidence: Are Data Governance, Data Protection, and Data Security processes in place? What is thereputation of the data, and is it verified or verifiable?
11. Value: Is there a good cost / benefit case for the data? Is it being optimally used? Does it endangerpeople’s safety or privacy, or the legal responsibilities of the enterprise? Does it support or contradictthe corporate image or the corporate message?
Common Dimensions of Data Quality
1. Accuracy 准确性
Accuracy refers to the degree that data correctly represents ‘real-life’ entities.
2. Completeness 完整性
Completeness refers to whether all required data is present. Completeness can be measured at the data set, record, or column level.
8. A measure of information quality completeness is A:an assessment of the percent of records having a non-null value for a specific field 完整性 B:an assessment of having the right level of granularity in the data values C:that a data value is from the correct domain of values for a field D:the degree of conformance 一致性 of data values to its domain E:All 正确答案:A 你的答案:D 解析:13.1.3:2013年,DAMA UK发布了一份白皮书,描述了数据质量的6个核心维度:1)完备性。存储数据量与潜在数据量的百分比
3. Consistency 一致性
Consistency can refer to ensuring that data values are consistently represented within a data set and between data sets, and consistently associated across data sets.
18. which of the following is the best example of the data quality dimension of 'consistency": A:The revenue data in the dataset is always $100 out B:The customer file has 50% duplicated entries C:All the records in the CRM have been accounted for in the data warehouse D:The phone numbers in the customer file do not adhere to the standard format E:The source data for the end of month report arrived 1 week late 正确答案:C 你的答案:E 解析13.1.3:6)一致性。比较事物多种表述与定义的差异。
4. Integrity 完整性
Data Integrity (or Coherence) includes ideas associated with completeness, accuracy, and consistency.
5. Reasonability 合理性
Reasonability asks whether a data pattern meets expectations.
6. Timeliness 及时性
Data currency is the measure of whether data values are the most up-to-date version of the information.
7. Uniqueness / Deduplication 唯一性
Uniqueness states that no entity exists more than once within the data set.
2. For the product lD in the Product table,what information quality measure would be MOST appropriate? A:Official definition B:Uniqueness C:validity D:Duplicate occurrences E:Accuracy 正确答案:B 你的答案:B 解析:13.1.3:2)唯一性。在满足对象识别的基础上不应多次记录实体实例(事物)。
23. Which of these is a key process in defining data quality business rules? A:Producing data quality reports dashboards B:Matching data from different data sources Answered C:Producing data management policies D:De-duplicating data records 去重 E:Separating data that does not meet business needs from data that does 正确答案:D 你的答案:E 解析:唯一性
8. Validity 有效性
Validity refers to whether data values are consistent with a defined domain of values.
Dimensions include some characteristics that can be measured objectively (completeness, validity, format conformity) and others that depend on heavily context or on subjective interpretation (usability, reliability, reputation).
1.4.4. Data Quality and Metadata 数据质量和元数据
Metadata is critical to managing the quality of data. The quality of data is based on how well it meets the requirements of data consumers. Metadata defines what the data represents.
Having a robust process by which data is defined supports the ability of an organization to formalize and document the standards and requirements by which the quality of data can be measured. Data quality is about meeting expectations. Metadata is a primary means of clarifying expectations.
1.4.5. Data Quality ISO Standard
1. ISO 8000-100
Introduction
2. ISO 8000-110
focused on the syntax, semantic encoding, and conformance to the data specification of Master Data
3. ISO 8000-120
Provenance 来源
4. ISO 8000-130
Accuracy
5. ISO 8000-140
Completeness
6. ISO 8000-150
Data Quality Management Architecture
1.4.6. Data Quality Improvement Lifecycle
This work is often done in conjunction with Data Stewards and other stakeholders.
The Shewhart Chart / Deming cycle 休哈特/戴明环
Plan
the Data Quality team assesses the scope, impact, and priority of known issues, and evaluates alternatives to address them.
Do
the DQ team leads efforts to address the root causes of issues and plan for ongoing monitoring of data. For root causes that are based on non-technical processes, the DQ team can work with process owners to implement changes. For root causes that require technical changes, the DQ team should work with technical teams to ensure that requirements are implemented correctly and that technical changes do not introduce errors.
Check
involves actively monitoring the quality of data as measured against requirements.
Act
for activities to address and resolve emerging data quality issues.
Continuous improvement is achieved by starting a new cycle. New cycles begin as:
1. Existing measurements fall below thresholds
2. New data sets come under investigation
3. New data quality requirements emerge for existing data sets
4. Business rules, standards, or expectations change
5. 28. when a data quality team has more issues than they can manage they should look to: A:hire more people B:implement data validation rules on data entry systems C:initiate data quality improvement cycles,focusing on achieving incremental improvements D:delete any issue that is greater than 6 months old E:establish a program of quick wins targeting easy fixes over a short time period 正确答案:C 你的答案:C 解析:13.2.6:初步评估获得的知识为特定的数据质量提升目标奠定了基础。数据质量提升可以采取不同的形式,从简单的补救(如纠正记录中的错误)到根本原因的改进。补救和改进计划应考虑可以快速实现的问题(可以立即以低成本解决问题)和长期的战略性变化。
5. The acronym PDCA stands for what quality improvement process? A:Precision-Data-Control-Accuracy B:Process-Definition-Clarity-Accessibility C:Project-Derivation-Completeness-Action D:Pan-Do-Check-Act E:Plan-Deploy-Check-Act 正确答案:D 你的答案:D 解析:13.1.3
20. Which of the following is NOT a stage in the Shewhart/Deming Cycle that drives the data quality improvement lifecycle A:Plan B:Check C:Do D:Investigate E:Act 正确答案:D 你的答案:D 解析:暂无解析
The cost of getting data right the first time is cheaper than the costs from getting data wrong and fixing it later.
Building quality into the data management processes from the beginning costs less than retrofitting it.
Maintaining high quality data throughout the data lifecycle is less risky than trying to improve quality in an existing process. It also creates a far lower impact on the organization.
Establishing criteria for data quality at the beginning of a process or system build is one sign of a mature Data Management Organization.
1.4.7. Data Quality Business Rule Types
Business rules describe how business should operate internally, in order to be successful and compliant with the outside world.
Data Quality Business Rules describe how data should exist in order to be useful and usable within an organization.
Some common simple business rule types are:
1. Definitional conformance: 定义一致性
Confirm that the same understanding of data definitions is implementedand used properly in processes across the organization.
7. To continually improve the quality of data and information an organization must consider which one of the following? A:The clarity and shared acceptance of data definitions. B:Maximize the effective use and value of data and information assets C:Control the cost of data management. D:Understand the information needs of the enterprise and its stakeholders E:All 正确答案:A 你的答案:D 解析:13.1.3:业务规则通常在软件中实现,或者使用文档模板输入数据。一些简单常见的业务规则类型有:1)定义一致性。确认对数据定义的理解相同,并在整个组织过程中得到实现和正确使用;确认包括对计算字段内任意时间或包含局部约束的算法协议,以及汇总和状态相互依赖规则。
2. Value presence and record completeness: 价值存在和记录完备
Rules defining the conditions under which missing valuesare acceptable or unacceptable.
3. Format compliance:
One or more patterns specify values assigned to a data element, such asstandards for formatting telephone numbers.
4. Value domain membership: 值域匹配
Specify that a data element’s assigned value is included in thoseenumerated in a defined data value domain
5. Range conformance: 范围一致性
A data element assigned value must be within a defined numeric, lexicographic,or time range, such as greater than 0 and less than 100 for a numeric range.
6. Mapping conformance:
Indicating that the value assigned to a data element must correspond to oneselected from a value domain that maps to other equivalent corresponding value domain(s).
7. Consistency rules:
Conditional assertions that refer to maintaining a relationship between two (ormore) attributes based on the actual values of those attributes.
8. Accuracy verification:
Compare a data value against a corresponding value in a system of record orother verified source
9. Uniqueness verification:
Rules that specify which entities must have a unique representation andwhether one and only one record exists for each represented real world object.
10. Timeliness validation:
Rules that indicate the characteristics associated with expectations foraccessibility and availability of data.
1.4.8. Common Causes of Data Quality Issues 数据质量问题的常见原因
1. Issues Caused by Lack of Leadership 缺乏领导力导致的问题
Many governance and information asset programs are driven solely by compliance, rather than by the potential value to be derived from data as an asset. A lack of recognition on the part of leadership means a lack of commitment within an organization to managing data as an asset, including managing its quality (Evans and Price, 2012).
Barriers to effective management of data quality include:
Lack of awareness on the part of leadership and staff
Lack of business governance
Lack of leadership and management
Difficulty in justification of improvements
Inappropriate or ineffective instruments to measure value
These barriers have negative effects on customer experience, productivity, morale, organizational effectiveness, revenue, and competitive advantage. They increase costs of running the organization and introduce risks as well.
2. Issues Caused by Data Entry Processes 数据输入过程引起的问题
1. Data entry interface issues 数据输入接口:
Poorly designed data entry interfaces can contribute to data qualityissues. If a data entry interface does not have edits or controls to prevent incorrect data from being putin the system data processors are likely to take shortcuts, such as skipping non-mandatory fields andfailing to update defaulted fields.
2. List entry placement 列表条目放置:
Even simple features of data entry interfaces, such as the order of values withina drop-down list, can contribute to data entry errors.
3. Field overloading 字段重载:
Some organizations re-use fields over time for different business purposes ratherthan making changes to the data model and user interface. This practice results in inconsistent andconfusing population of the fields.
4. Training issues 培训问题:
Lack of process knowledge can lead to incorrect data entry, even if controls and editsare in place. If data processors are not aware of the impact of incorrect data or if they are incented forspeed, rather than accuracy, they are likely to make choices based on drivers other than the quality ofthe data.
5. Changes to business processes 业务流程的变更:
Business processes change over time, and with these changes newbusiness rules and data quality requirements are introduced. However, business rule changes are notalways incorporated into systems in a timely manner or comprehensively. Data errors will result if aninterface is not upgraded to accommodate new or changed requirements. In addition, data is likely tobe impacted unless changes to business rules are propagated throughout the entire system.
6. Inconsistent business process execution 业务流程执行混乱:
Data created through processes that are executedinconsistently is likely to be inconsistent. Inconsistent execution may be due to training ordocumentation issues as well as to changing requirements.
3. Issues Caused by Data Processing Functions
1. Incorrect assumptions about data sources 有关数据元的错误假设:
Production issues can occur due to errors or changes,inadequate or obsolete system documentation, or inadequate knowledge transfer (for example, whenSMEs leave without documenting their knowledge). System consolidation activities, such as thoseassociated with mergers and acquisitions, are often based on limited knowledge about the relationshipbetween systems.
2. Stale business rules 过时的业务规则:
Over time, business rules change. They should be periodically reviewed andupdated. If there is automated measurement of rules, the technical process for measuring rules shouldalso be updated.
3. Changed data structures 变更的数据结构:
Source systems may change structures without informing downstreamconsumers (both human and system) or without providing sufficient time to account for the changes.This can result in invalid values or other conditions that prevent data movement and loading, or inmore subtle changes that may not be detected immediately.
4. Issues Caused by System Design 系统设计引起的问题
1. Failure to enforce referential integrity: Referential integrity is necessary to ensure high quality dataat an application or system level. If referential integrity is not enforced or if validation is switched off(for example, to improve response times), various data quality issues can arise:
1. Duplicate data that breaks uniqueness rules
2. Orphan rows, which can be included in some reports and excluded from others, leading tomultiple values for the same calculation
3. Inability to upgrade due to restored or changed referential integrity requirements
4. Inaccurate data due to missing data being assigned default values
2. Failure to enforce uniqueness constraints: Multiple copies of data instances within a table or fileexpected to contain unique instances. If there are insufficient checks for uniqueness of instances, or ifthe unique constraints are turned off in the database to improve performance, data aggregation resultscan be overstated.
3. Coding inaccuracies and gaps: If the data mapping or layout is incorrect, or the rules for processingthe data are not accurate, the data processed will have data quality issues, ranging from incorrectcalculations to data being assigned to or linked to improper fields, keys, or relationships.
4. Data model inaccuracies: If assumptions within the data model are not supported by the actual data,there will be data quality issues ranging from data loss due to field lengths being exceeded by theactual data, to data being assigned to improper IDs or keys.
5. Field overloading: Re-use of fields over time for different purposes, rather than changing the datamodel or code can result in confusing sets of values, unclear meaning, and potentially, structuralproblems, like incorrectly assigned keys.
6. Temporal data mismatches: In the absence of a consolidated data dictionary, multiple systems couldimplement disparate date formats or timings, which in turn lead to data mismatch and data loss whendata synchronization takes place between different source systems.
7. Weak Master Data Management 主数据管理薄弱: Immature Master Data Management can lead to choosingunreliable sources for data, which can cause data quality issues that are very difficult to find until theassumption that the data source is accurate is disproved.
8. Data duplication 数据复制: Unnecessary data duplication is often a result of poor data management. There aretwo main types of undesirable duplication issues:
Single Source – Multiple Local Instances: For example, instances of the same customer inmultiple (similar or identical) tables in the same database. Knowing which instance is themost accurate for use can be difficult without system-specific knowledge.
Multiple Sources – Single Instance: Data instances with multiple authoritative sources orsystems of record. For example, single customer instances coming from multiple point-of-salesystems. When processing this data for use, there can be duplicate temporary storage areas.Merge rules determine which source has priority over others when processing into permanentproduction data areas.
34. One of the difficulties when integrating multiple source systems is A:maintaining documentation describing the data warehouse operation B:modifying the source systems to align to the enterprise data model C:determining valid links or equivalences between data elements D:completing the data architecture on time for the first release E:having a data quality rule applicable to all source systems 正确答案:C 你的答案:E 解析:13.1.3. ②多源-单一本地实例。具有多个权威来源或记录系统的数据实例。例如,来自多个销售点系统的单个客户实例。处理此类数据时,可能会产生重复的临时存储区域,当把其处理为永久性的生产数据区时,合并规则决定哪个“源”具有更高的优先级。
5. Issues Caused by Fixing Issues 解决问题引起的问题
Manual data patches are changes made directly on the data in the database, not through the business rules in the application interfaces or processing. These are scripts or manual commands generally created in a hurry and used to ‘fix’ data in an emergency such as intentional injection of bad data, lapse in security, internal fraud, or external source for business disruption
Like any untested code, they have a high risk of causing further errors through unintended consequences,
these shortcuts are strongly discouraged – they are opportunities for security breaches and business disruption longer than a proper correction would cause. All changes should go through a governed change management process.
1.4.9. Data Profiling
Data Profiling is a form of data analysis used to inspect data and assess quality. Data profiling uses statistical techniques to discover the true structure, content, and quality of a collection of data (Olson, 2003). A profiling engine produces statistics that analysts can use to identify patterns in data content and structure. For example:
1. Counts of nulls 空值数: Identifies nulls exist and allows for inspection of whether they are allowable or not
2. Max/Min value 最大最小值: Identifies outliers, like negatives
3. Max/Min length 最大最小长度: Identifies outliers or invalids for fields with specific length requirements
4. Frequency distribution of values for individual columns 单个列值的频率分布: Enables assessment of reasonability (e.g.,distribution of country codes for transactions, inspection of frequently or infrequently occurring values,as well as the percentage of the records populated with defaulted values)
5. Data type and format 数据类型和格式: Identifies level of non-conformance to format requirements, as well asidentification of unexpected formats (e.g., number of decimals, embedded spaces, sample values)
6. 38. In data integration, the goal of data discovery is to A:identify key users and perform high level assessment of data quality B:assign data glossary terms and data formats C:identify potential sources and assure data recovery processes are compliant. D:assign data glossary terms and canonical models E:identify potential sources and perform high-level assessment of data quality. 正确答案:E 你的答案:E 解析:13.1.3.:分析人员必须评估剖析引擎的结果,以确定数据是否符合规则和其他要求。一个好的分析人员可以使用分析结果确认已知的关系,并发现数据集内和数据集之间隐藏的特征和模式,包括业务规则和有效性约束。剖析通常被作为项目中数据发现的一部分(尤其是数据集成项目),或者用于评估待改进的数据的当前状态。数据剖析结果可用来识别那些可以提升数据和元数据质量的机会
1.4.10. Data Quality and Data Processing
1. While the focus of data quality improvement efforts is often on the prevention of errors, data quality can also be improved through some forms of data processing.
2. Data Cleansing 数据清理
Data Cleansing or Scrubbing transforms data to make it conform to data standards and domain rules. Cleansing includes detecting and correcting data errors to bring the quality of data to an acceptable level.
40. A Term transforms data to make it conform to data standards and domain rules this term is called A:Data Profiling B:Data parsing C:Data Modeling D:Data analysis E:Data Cleansing 正确答案:E 你的答案:E 解析:13.1.3(1)数据清理或数据清洗,可以通过数据转换使其符合数据标准和域规则.清理包括检测和纠正数据错误,使数据质量达到可接受的水平。
It costs money and introduces risk to continuously remediate data through cleansing
In some situations, correcting on an ongoing basis may be necessary, as re-processing the data in a midstream system is cheaper than any other alternative.
3. Data Enhancement 数据增强
Data enhancement or enrichment is the process of adding attributes to a data set to increase its quality and usability.
39. A Term is the process of adding attributes to a data set to increase its quality and usability. this term is called A:Data Profiling B:Data parsing C:Data Modeling D:Data analysis E:Data enhancement 正确答案:E 你的答案:E 解析:13.1.3.:(2)数据增强数据增强或丰富是给数据集添加属性以提高其质量和可用性的过程。
Some enhancements are gained by integrating data sets internal to an organization. External data can also be purchased to enhance organizational data
Time/Date stamps 时间戳:
One way to improve data is to document the time and date that data items arecreated, modified, or retired, which can help to track historical data events. If issues are detected withthe data, timestamps can be very valuable in root cause analysis, because they enable analysts to isolatethe timeframe of the issue.
Audit data 审计数据:
Auditing can document data lineage, which is important for historical tracking as well asvalidation.
Reference vocabularies参考词汇表:
Business specific terminology, ontologies, and glossaries enhanceunderstanding and control while bringing customized business context.
Contextual information 语境信息:
Adding context such as location, environment, or access methods andtagging data for review and analysis.
Geographic information 地理信息:
Geographic information can be enhanced through address standardizationand geocoding, which includes regional coding, municipality, neighborhood mapping, latitude /longitude pairs, or other kinds of location-based data.
Demographic information 人口统计信息:
Customer data can be enhanced through demographic information, suchas age, marital status, gender, income, or ethnic coding. Business entity data can be associated withannual revenue, number of employees, size of occupied space, etc.
Psychographic information 心理信息:
Data used to segment the target populations by specific behaviors,habits, or preferences, such as product and brand preferences, organization memberships, leisureactivities, commuting transportation style, shopping time preferences, etc.
Valuation information 评估信息:
Use this kind of enhancement for asset valuation, inventory, and sale.
4. Data Parsing and Formatting 数据解析和格式化
Data Parsing is the process of analyzing data using pre-determined rules to define its content or value.
5. Data Transformation and Standardization 数据转换与标准化
During normal processing, data rules trigger and transform the data into a format that is readable by the target architecture.
2. Activities
2.1. Define High Quality Data
2.1.1. Ask a set of questions to understand current state and assess organizational readiness for data quality improvement:
1. What do stakeholders mean by ‘high quality data’?
2. What is the impact of low quality data on business operations and strategy?
3. How will higher quality data enable business strategy?
4. What priorities drive the need for data quality improvement?
5. What is the tolerance for poor quality data?
6. What governance is in place to support data quality improvement?
7. What additional governance structures will be needed?
2.1.2. Getting a comprehensive picture of the current state of data quality in an organization requires approaching the question from different perspectives:
1. An understanding of business strategy and goals
2. Interviews with stakeholders to identify pain points, risks, and business drivers
3. Direct assessment of data, through profiling and other form of analysis
4. Documentation of data dependencies in business processes
5. Documentation of technical architecture and systems support for business processes
2.2. Define a Data Quality Strategy
2.2.1. Data quality priorities must align with business strategy. Adopting or developing a framework and methodology will help guide both strategy and tactics while providing a means to measure progress and impacts.
2.2.2. A framework should include methods to
1. Understand and prioritize business needs
2. Identify the data critical to meeting business needs
3. Define business rules and data quality standards based on business requirements
4. Assess data against expectations
5. Share findings and get feedback from stakeholders
6. Prioritize and manage issues
7. Identify and prioritize opportunities for improvement
8. Measure, monitor, and report on data quality
9. Manage Metadata produced through data quality processes
10. Integrate data quality controls into business and technical processes
2.3. Identify Critical Data and Business Rules
2.3.1. Not all data is of equal importance. Data Quality Management efforts should focus first on the most important data in the organization:
2.3.2. Data can be prioritized based on factors such as regulatory requirements, financial value, and direct impact on customers.
2.3.3. Often, data quality improvement efforts start with Master Data, which is, by definition, among the most important data in any organization.
1. What quality process should be done first in order to measure information quality? A:Re-engineer data B:Cleanse data C:Assess data definitions D:Measure information costs E:database security 正确答案:C 你的答案:C 解析13.2活动13.2.1定义高质量数据在启动数据质量方案之前,有益的做法是了解业务需求、定义术语、识别组织的痛点,并开始就数据质量改进的驱动因素和优先事项达成共识。
2.3.4. Having identified the critical data, Data Quality analysts need to identify business rules that describe or imply expectations about the quality characteristics of data.
2.3.5. Defining data quality rules is challenging because most people are not used to thinking about data in terms of rules.
It may be necessary to get at the rules indirectly, by asking stakeholders about the input and output requirements of a business process
Keep in mind that it is not necessary to know all the rules in order to assess data. Discovery and refinement of rules is an ongoing process
One of the best ways to get at rules is to share results of assessments.
2.4. Perform an Initial Data Quality Assessment
2.4.1. The goal of an initial data quality assessment is to learn about the data in order to define an actionable plan for improvement. It is usually best to start with a small, focused effort – a basic proof of concept – to demonstrate how the improvement process works. Steps include:
1. Define the goals of the assessment; these will drive the work
2. Identify the data to be assessed; focus should be on a small data set, even a single data element, or aspecific data quality problem
3. Identify uses of the data and the consumers of the data
4. Identify known risks with the data to be assessed, including the potential impact of data issues onorganizational processes
5. Inspect the data based on known and proposed rules
6. Document levels of non-conformance and types of issues
7. Perform additional, in-depth analysis based on initial findings in order to
Quantify findings
Prioritize issues based on business impact
Develop hypotheses about root causes of data issues
8. Meet with Data Stewards, SMEs, and data consumers to confirm issues and priorities
9. Use findings as a foundation for planning
Remediation of issues, ideally at their root causes
Controls and process improvements to prevent issues from recurring
Ongoing controls and reporting
2.5. Identify and Prioritize Potential Improvements
2.5.1. Having proven that the improvement process can work, the next goal is to apply it strategically. Doing so requires identifying and prioritizing potential improvements.
2.5.2. Identification may be accomplished by full-scale data profiling of larger data sets to understand the breadth of existing issues.
2.5.3. Identification may be accomplished by full-scale data profiling of larger data sets to understand the breadth of existing issues.
2.6. Define Goals for Data Quality Improvement
2.6.1. When issues are found, determine ROI of fixes based on:
1. The criticality (importance ranking) of the data affected
2. Amount of data affected
3. The age of the data
4. Number and type of business processes impacted by the issue
5. Number of customers, clients, vendors, or employees impacted by the issue
6. Risks associated with the issue
7. Costs of remediating root causes
8. Costs of potential work-arounds
2.6.2. In assessing issues, especially those where root causes are identified and technical changes are required, always seek out opportunities to prevent issues from recurring.
2.6.3. Preventing issues generally costs less than correcting them – sometimes orders of magnitude less.
2.7. Develop and Deploy Data Quality Operations
2.7.1. Manage Data Quality Rules
data quality rules and standards are a critical form of Metadata. To be effective, they need to be managed as Metadata. Rules should be:
1. Documented consistently: 记录的一致性
Establish standards and templates for documenting rules so that they havea consistent format and meaning.
2. Defined in terms of Data Quality dimensions: 根据数据质量维度顶底
Consistent application of dimensions will help with the measurement and issuemanagement processes.
3. Tied to business impact: 与业务影响挂钩
While data quality dimensions enable understanding of common problems,they are not a goal in-and-of-themselves. Measurements that are not tied to business processes should not be taken.
4. Backed by data analysis: 数据分析支持
Data Quality Analysts should not guess at rules. Rules should be tested against actual data.
5. Confirmed by SMEs: 由领域专家确认
This knowledge comes when subject matter experts confirm or explain the results of data analysis.
6. Accessible to all data consumers: 所有数据消费者都可以访问
All data consumers should have access to documented rules. Ensure that consumers have a means to ask questions about and provide feedback on rules.
2.7.2. Measure and Monitor Data Quality
There are two equally important reasons to implement operational data quality measurements:
To inform data consumers about levels of quality
To manage risk that change may be introduced through changes to business or technical processes
4. A metric for data quality accuracy is A:percent correct. B:percent complete C:percent unique D:percent defined E:all 正确答案:A 你的答案:A 解析:13.2.7
Provide continuous monitoring by incorporating control and measurement processes into the information processing flow. Automated monitoring of conformance to data quality rules can be done in-stream or through a batch process. Measurements can be taken at three levels of granularity: the data element value 数据元素值, data instance or record 数据实例或记录, or the data set 数据集.

19. Data quality measurements can be taken at three levels of granularity. They are: A:Fine data, coarse data, and rough data B:Departmental data, regional data, and enterprise data C:Data element value, data instance or record and data set D:Person data,,location data, and product data E:Historical data current data and future dated data 正确答案:C 你的答案:C 解析:13.2.7题解:通过将控制和度量过程纳入信息处理流程进行持续的监控,可以通过流程或批处理的方式对数据质量规则的一致性进行自动监控,在三个粒度级别上进行度量:数据元素值、数据实例或记录、数据集。
2.7.3. Develop Operational Procedures for Managing Data Issues
1. Diagnosing issues 诊断问题
The objective is to review the symptoms of the data quality incident, trace the lineage of the data in question, identify the problem and where it originated, and pinpoint potential root causes of the problem.
The work of root cause analysis requires input from technical and business SMEs. success requires cross-functional
14. The objectives of a good issue resolution process include all of the following EXCEPT A:address problems, not symptoms B:identify who is causing the problem. C:recognize and respond when issued are identified D:provide a forum for issues to be raised E:all 正确答案:B 你的答案:D 解析:13.1.3:数据质量改进的常用方法如图13-3所示,是戴明环的一个版本。基于科学的方法,戴明环是一个被称为“计划-执行-检查-处理”的问题解决模型。改进是通过一组确定的步骤来实现的。必须根据标准测量数据状况,如果数据状况不符合标准,则必须确定并纠正与标准不符的根本原因。
2. Formulating options for remediation 制定补救方案:
1. Addressing non-technical root causes such as lack of training, lack of leadership support,unclear accountability and ownership, etc.
2. Modification of the systems to eliminate technical root causes
3. Developing controls to prevent the issue
4. Introducing additional inspection and monitoring
5. Directly correcting flawed data
6. Taking no action based on the cost and impact of correction versus the value of the datacorrection
3. Resolving issues 解决问题:
Having identified options for resolving the issue, the Data Quality team must confer with the business data owners to determine the best way to resolve the issue. These procedures should detail how the analysts:
1. Assess the relative costs and merits of the alternatives
2. Recommend one of the planned alternatives
3. Provide a plan for developing and implementing the resolution
4. Implement the resolution
4. To support effective tracking
1. Standardize data quality issues and activities: 标准化数据质量问题和活动
Since the terms used to describe data issues may varyacross lines of business, it is valuable to define a standard vocabulary for the concepts used.
2. Provide an assignment process for data issues: 提供数据问题的分配过程
Drive the assignment process within the incident tracking system by suggesting those individuals withspecific areas of expertise.
3. Manage issue escalation procedures 管理数据问题升级过程:
Data quality issue handling requires a well-defined system of escalation based on the impact, duration, or urgency of an issue.
4. Manage data quality resolution workflow: 管理数据质量解决方案工作流
The incident tracking system can support workflow management to track progress with issues diagnosis and resolution.
5. 31. which of the following is NOT required to effectively track data quality incidents A:A standard vocabulary for classifying data quality issues B:A well defined system of escalation based on the impact duration or urgency of an is C:An effective service level agreement with defined rewards and penalties D:An operational workflow that ensures effective resolution E:An assignment process to appropriate individuals and teams 正确答案:C 你的答案:A 解析:13.2.7.在问题管理过程中做出的决定应在事件跟踪系统中进行记录跟踪。如果这个跟踪系统得到良好的管理,它可以提供关于数据问题原因和成本的一些有价值的洞察,包括问题和根本原因的描述、补救方案以及如何解决该问题的决定。事件跟踪系统将收集与解决问题、分配工作、问题数量、发生频率,以及做出响应、给出诊断、计划解决方案和解决问题所需时间相关的性能数据。这些指标可以为当前工作流的有效性、系统和资源利用率提供有价值的洞察,它们是重要的管理数据点,可以推动数据质量控制进行持续的、具有可操作性的改进。
6. 32. The steps followed in managing data issues include A:Read, guess code release B:Standardization Allocation Assignment and correction C:Escalation,Review,Allocation and Completion D:Standardization,Assignment,Escalation, and Completion E:Standardization Explanation, ownership, and Completion 正确答案:D 你的答案:D 解析:13.2.7.在问题管理过程中做出的决定应在事件跟踪系统中进行记录跟踪。如果这个跟踪系统得到良好的管理,它可以提供关于数据问题原因和成本的一些有价值的洞察,包括问题和根本原因的描述、补救方案以及如何解决该问题的决定。事件跟踪系统将收集与解决问题、分配工作、问题数量、发生频率,以及做出响应、给出诊断、计划解决方案和解决问题所需时间相关的性能数据。这些指标可以为当前工作流的有效性、系统和资源利用率提供有价值的洞察,它们是重要的管理数据点,可以推动数据质量控制进行持续的、具有可操作性的改进。
2.7.4. Establish Data Quality Service Level Agreements
Operational data quality control defined in a data quality SLA includes:
1. Data elements covered by the agreement
2. Business impacts associated with data flaws
3. Data quality dimensions associated with each data element
4. Expectations for quality for each data element for each of the identified dimensions in each applicationor system in the data value chain
5. Methods for measuring against those expectations
6. Acceptability threshold for each measurement
7. Steward(s) to be notified in case the acceptability threshold is not met
8. Timelines and deadlines for expected resolution or remediation of the issue
9. Escalation strategy, and possible rewards and penalties
The data quality SLA also defines the roles and responsibilities associated with performance of operational data quality procedures.
37. A Data Quality Service Level Agreement (SLA) would normally include which of these? A:breakdown of the costs of data quality improvement B:An enterprise data model C:Detailed technical specifications for data transfer D:A Business Case for data improvement E:Respective roles responsibilities for data quality 正确答案:E 你的答案:E 解析:13.2.7:4.制定数据质量服务水平协议数据质量服务水平协议(SLA)规定了组织对每个系统中数螺质量问题进行响应和补救的期望。9)升级策略,以及可能的奖励和感罚,数据质量SLA还定义了与业务酸螺质量过程绩效相关的角色和职责
2.7.5. Develop Data Quality Reporting
1. Data quality scorecard, which provides a high-level view of the scores associated with various metrics,reported to different levels of the organization within established thresholds
13. The following are considerations for a good data quality scorecard for a data governance program EXCEPT A:data profiling aggregated metrics B:technical and non-technical metrics C:different views of the scorecard for difference audiences D:data profiling non-aggregated metrics E:all 正确答案:D 你的答案:C 解析:13.2.7. AD冲突,高级别视角需要聚合值。1)数据质量评分卡。可从高级别的视角提供与各种指标相关的分数,并在既定的闽值内向组织的不同层级报告。
2. Data quality trends, which show over time how the quality of data is measured, and whether trending isup or down
3. SLA Metrics, such as whether operational data quality staff diagnose and respond to data qualityincidents in a timely manner
4. Data quality issue management, which monitors the status of issues and resolutions
5. Conformance of the Data Quality team to governance policies
6. Conformance of IT and business teams to Data Quality policies
7. Positive effects of improvement projects
2.8. 22. which of these is NOT a typical activity in data Quality Management? A:Enterprise Data Modelling B:identifying data problems issues C:creating inspection monitoring processes D:Analyzing data quality E:Defining business requirements business rules 正确答案:A 你的答案:A 解析13.2活动13.2.1定义高质量数据13.2.2定义数据质量战略13.2.3识别关键数据和业务规则13.2.4执行初始数据质量评估13.2.5识别改进方向并确定优先排序13.2.6定义数据质量改进目标 13.2.7开发和部署数据质量操作
3. Tools
3.1. Data Profiling Tools
3.1.1. Data profiling tools produce high-level statistics that enable analysts to identify patterns in data and perform initial assessment of quality characteristics.
3.2. Data Querying Tools
3.2.1. Data Quality team members also need to query data more deeply to answer questions raised by profiling results and find patterns that provide insight into root causes of data issues.
3.3. Modeling and ETL Tools
3.3.1. The tools used to model data and create ETL processes have a direct impact on the quality of data.
3.4. Data Quality Rule Templates
3.4.1. Rule templates allow analyst to capture expectations for data. Templates also help bridge the communications gap between business and technical teams
3.5. Metadata Repositories
3.5.1. defining data quality requires Metadata and definitions of high quality data are a valuable kind of Metadata. DQ
4. Techniques
4.1. Preventive Actions 预防措施
4.1.1. The best way to create high quality data is to prevent poor quality data from entering an organization. Preventive actions stop known errors from occurring. Inspecting data after it is in production will not improve its quality.
4.1.2. Approaches include:
Establish data entry controls: 建立数据输入控制
Create data entry rules that prevent invalid or inaccurate data from entering a system.
15. The following are deliverables in a data governance program EXCEPT A:the metrics for data quality scorecard B:data input controls C:indexing of data management techniques D:meta-data standards. E:Data strategy 正确答案:B 你的答案:C 解析:13.4.1:B为数据质量需要考虑的。1)建立数据输入控制。创建数据输入规则,防止无效或不准确的数据进入系统。
Train data producers: 培训数据生产者
Ensure staff in upstream systems understand the impact of their data ondownstream users.
Define and enforce rules: 定义和执行规则
Create a ‘data firewall,’ which has a table with all the business data qualityrules used to check if the quality of data is good, before being used in an application such a datawarehouse.
Demand high quality data from data suppliers: 要求供应商提供高质量数据
Examine an external data provider’s processes tocheck their structures, definitions, and data source(s) and data provenance.
Implement Data Governance and Stewardship: 实施数据治理和管理制度
Ensure roles and responsibilities are defined thatdescribe and enforce rules of engagement, decision rights, and accountabilities for effectivemanagement of data and information assets (McGilvray, 2008).
Institute formal change control: 制定正式的变更控制
Ensure all changes to stored data are defined and tested before beingimplemented.
4.2. Corrective Actions 纠正措施
4.2.1. Data quality issues should be addressed systemically and at their root causes to minimize the costs and risks of corrective actions.
4.2.2. ‘Solve the problem where it happens’ 就地解决 is the best practice in Data Quality Management.
4.2.3. Perform data correction in three general ways:
Automated correction: 自动修正
Automated correction techniques include rule-based standardization,normalization, and correction. The modified values are obtained or generated and committed withoutmanual intervention.
Manually-directed correction: 人工检查修正
Use automated tools to remediate and correct data but require manualreview before committing the corrections to persistent storage.
21. what is Manual Directed Data Quality Correction? A:Teams of data correctors supervised by data subject matter experts B:the automation of all data cleanse and correction routines C:The use of spreadsheets to manually inspect and correct data D:Using a data quality improvement manual to guide data cleanse and correction activities E:The use of automated cleanse correction tools with results manually checked before committing outputs 正确答案:E 你的答案:D 解析:13.4.2. Manually-directed correction: 人工检查修正
Manual correction: 人工修正
Sometimes manual correction is the only option in the absence of tools orautomation or if it is determined that the change is better handled through human oversight.
4.3. Quality Check and Audit Code Modules
4.3.1. Create shareable, linkable, and re-usable code modules that execute repeated data quality checks and audit processes that developers can get from a library.
4.4. Effective Data Quality Metrics 有效的数据度量指标
4.4.1. A critical component of managing data quality is developing metrics that inform data consumers about quality characteristics that are important to their uses of data.
4.4.2. DQ analysts should account for these characteristics:
1. Measurability: 可度量性
A data quality metric must be measurable – it needs to be something that can be counted.
2. Business relevance: 业务相关性
While many things are measurable, not all translate into useful metrics. Measurements need to be relevant to data consumers.
3. Acceptability:
Determine whether data meets business expectations based on specified acceptability thresholds.
4. Accountability / Stewardship: 问责管理制度
Metrics should be understood and approved by key stakeholders The business data owner is accountable, while a datasteward takes appropriate corrective action.
5. Controllability:
A metric should reflect a controllable aspect of the business. In other words, if themetric is out of range, it should trigger action to improve the data. If there is no way to respond, thenthe metric is probably not useful.
6. Trending: 趋势分析
Metrics enable an organization to measure data quality improvement over time.
4.5. Statistical Process Control
4.5.1. Statistical Process Control (SPC) is a method to manage processes by analyzing measurements of variation in process inputs, outputs, or steps.
4.5.2. SPC is based on the assumption that when a process with consistent inputs is executed consistently, it will produce consistent outputs.
4.5.3. The primary tool used for SPC is the control chart (Figure 95), which is a time series graph that includes a central line for the average (the measure of central tendency) and depicts calculated upper and lower control limits (variability around a central value). In a stable process, measurement results outside the control limits indicate a special cause.

4.5.4. SPC measures the predictability of process outcomes by identifying variation within a process. Processes have variation of two types:
Common Causes that are inherent in the process 流程内部固有的常见原因
Special Causes that are unpredictable or intermittent. 不可预测或间歇性的特殊原因
4.5.5. SPC is used for control, detection, and improvement. The first step is to measure the process in order to identify and eliminate special causes. This activity establishes the control state of the process. Next is to put in place measurements to detect unexpected variation as soon as it is detectable.
4.6. Root Cause Analysis
4.6.1. A root cause of a problem is a factor that, if eliminated, would remove the problem itself. Root cause analysis is a process of understanding factors that contribute to problems and the ways they contribute. Its purpose is to identify underlying conditions that, if eliminated, would mean problems would disappear.
4.6.2. Common techniques for root cause analysis include Pareto analysis (the 80/20 rule), fishbone diagram analysis, track and trace, process analysis, and the Five Whys (McGilvray, 2008).
5. Implementation Guidelines
5.1. most Data Quality program implementations need to plan for:
5.1.1. Metrics on the value of data and the cost of poor quality data:
One way to raise organizationalawareness of the need for Data Quality Management is through metrics that describe the value of dataand the return on investment from improvements.
5.1.2. Operating model for IT/Business interactions:
Data Custodians from IT understand where and how the data is stored, and so they arewell placed to translate definitions of data quality into queries or code that identify specific records thatdo not comply.
12. Data quality stewardship begins by A:stating that data is poorly defined misinterpreted and inconsistently used B:proving that needed data can't be found in a timely C:creating awareness that data quality is a business issue that can't simply be dismissed as an lT problem D:showing the existence of and demonstrating inconsistent data sources on multiple databases with conflicting data E:none 正确答案:C 你的答案:C 解析:题解:有效的数据管理涉及一系列复杂的、相互关联的过程,它使组织能够利用他们的数据来实现其战略目标。数据管理能力包括为各类应用设计数据模型、安全存储和访问
5.1.3. Changes in how projects are executed:
Project oversight must ensure project funding includes stepsrelated to data quality It is prudent to makesure issues are identified early and to build data quality expectations upfront in projects.
5.1.4. Changes to business processes:
The Data Quality team needs to be able to assess and recommend changes to non-technical (as well as technical) processes that impact the quality of data.
5.1.5. Funding for remediation and improvement projects:
Data will not fix itself. The costsand benefits of remediation and improvement projects should be documented so that work onimproving data can be prioritized.
5.1.6. Funding for Data Quality Operations:
Sustaining data quality requires ongoing operations tomonit or data quality, report on findings, and continue to manage issues as they are discovered.
5.2. Readiness Assessment / Risk Assessment
5.2.1. Management commitment to managing data as a strategic asset
5.2.2. The organization’s current understanding of the quality of its data:
5.2.3. The actual state of the data: Finding an objective way to describe the condition of data that is causingpain points is the first step to improving the data.
5.2.4. Risks associated with data creation, processing, or use
5.2.5. Cultural and technical readiness for scalable data quality monitoring
5.3. Organization and Cultural Change
5.3.1. The quality of data will not be improved through a collection of tools and concepts, but through a mindset that helps employees and stakeholders to act while always thinking of the quality of data and what the business and their customers need.
5.3.2. All employees must act responsibly and raise data quality issues,Data quality is not just the responsibility of a DQ team or IT group.
5.3.3. Just as the employees need to understand the cost to acquire a new customer or retain an existing customer, they also need to know the organizational costs of poor quality data, as well as the conditions that cause data to be of poor quality.
5.3.4. employees need to think and act differently if they are to produce better quality data and manage data in ways that ensures quality. This requires training and reinforcement.
1. Common causes of data problems
2. Relationships within the organization’s data ecosystem and why improving data quality requires anenterprise approach
3. Consequences of poor quality data
4. Necessity for ongoing improvement (why improvement is not a one-time thing)
5. Becoming ‘data-lingual 数据语言化’, about to articulate the impact of data on organizational strategy and success,regulatory reporting, customer satisfaction
6. 9. Information quality training topics for business information/data stewards should include all of the following EXCEPT A:systems development principles. B:roles and responsibilities. C:data security and privacy D:information value chain E:All 正确答案:A 你的答案:A 解析:13.5.2:最终,如果要让员工生成更高质量的数据并以确保质量的方式管理数据,他们需要以不同的方式思考和行动,这需要培训和强化训练。培训应着重于:1)导致数据问题的常见原因。2)组织数据生态系统中的关系以及为什么提高数据质量需要全局方法。3)糟糕数据造成的后果。4)持续改进的必要性(为什么改进不是一次性的)。5)要“数据语言化”,阐述数据对组织战略与成功、监管报告和客户满意度的影响。培训还应包括对任何过程变更的介绍,以及有关变更如何提高数据质量的声明。
7. 11. Information quality training topics for business management and process owners should include all of the following EXCEPT A:policies and principles B:value chain C:relevance decay. D:responsibilities. E:Leadership 正确答案:C 你的答案:C 解析:C无关
6. Data Quality and Data Governance
6.1. Often data quality issues are the reason for establishing enterprise-wide data governance
6.1.1. Risk and security personnel who can help identify data-related organizational vulnerabilities
6.1.2. Business process engineering and training staff who can help teams implement process improvements
6.1.3. Business and operational data stewards, and data owners who can identify critical data, definestandards and quality expectations, and prioritize remediation of data issues
6.2. A Governance Organization can accelerate the work of a Data Quality program by:
1. Setting priorities
2. Identifying and coordinating access to those who should be involved in various data quality-relateddecisions and activities
3. Developing and maintaining standards for data quality
4. Reporting relevant measurements of enterprise-wide data quality
5. Providing guidance that facilitates staff involvement
6. Establishing communications mechanisms for knowledge-sharing
7. Developing and applying data quality and compliance policies
8. Monitoring and reporting on performance
9. Sharing data quality inspection results to build awareness, identify opportunities for improvements,and build consensus for improvements
10. Resolving variations and conflicts; providing direction
6.3. Data Quality Policy 数据质量制度
6.3.1. Each policy should include:
1. Purpose, scope and applicability of the policy
2. Definitions of terms
3. Responsibilities of the Data Quality program
4. Responsibilities of other stakeholders
5. Reporting
6. Implementation of the policy, including links to risk, preventative measures, compliance, dataprotection, and data security
6.4. Metrics
1. High-level categories of data quality metrics include:
Return on Investment: Statements on cost of improvement efforts vs. the benefits of improved dataquality
Levels of quality: Measurements of the number and percentage of errors or requirement violationswithin a data set or across data sets
Data Quality trends: Quality improvement over time (i.e., a trend) against thresholds and targets, orquality incidents per period
Data issue management metrics:
Counts of issues by dimensions of data quality
Issues per business function and their statuses (resolved, outstanding, escalated)
Issue by priority and severity
Time to resolve issues
Conformance to service levels 服务一致性水平: Organizational units involved and responsible staff, projectinterventions for data quality assessments, overall process conformance
Data Quality plan rollout 数据质量计划示意图: As-is and roadmap for expansion
7. Works Cited / Recommended
7.1. 3. which of the following is NOT considered a possible data content defect type? A:Domain chaos B:Combining accurate data with inaccurate data C:Data defined with a lot of embedded meaning D:Duplicate occurrences E:all 正确答案:C 你的答案:D 解析:13.1.3:2013年,DAMAUK发布了一份白皮书,描述了数据质量的6个核心维度:1)完备性。存储数据量与潜在数据量的百分比。2)唯一性。在满足对象识别的基础上不应多次记录实体实例(事物)。3)及时性,数据从要求的时间点起代表现实的程度。4)有效性,如数据符合其定义的语法(格式、类型、范围),则数据有效。5)准确性。数据正确描述所描述的“真实世界“对象或事件的程度。6)一致性。比较事物多种表述与定义的差异。
7.2. 16. which of the following best defines the term disparate data? A:Data that is stored in a data warehouse B:Data that is mapped between two or more applications C:Data that is scattered among multiple databases D:Data that differs in kind,quality or character. E:none 正确答案:D 你的答案:D 解析:ABC是位置不同
7.3. 17. which of the following is NOT usually a feature of data quality improvement tools? A:Standardization B:Parsing C:Data modelling D:Data profiling E:Transformation 正确答案:C 你的答案:E 解析:来源:13.3 数据质量 工具
7.4. 25. The data Quality Management cycle has four stages. Three are plan monitor Act. What is the fourth stage? A:improve B:Deploy 部署 C:Prepare D:Reiterate 重申 E:Manage 正确答案:B 你的答案:A
7.5. 26. A data quality report assesses the coding of deposit transactions. The following variations in the coding are apparent: DEP, Dep, dep dEp . Which DMBOK knowledge area has been ignored? A:Data Governance B:Metadata Management C:Data Storage and Operation D:Data Quality E:Reference and master data 正确答案:D 你的答案:B
7.6. 27. which DMBOK knowledge area is most likely responsible for a high percentage of returned mail? A:Data integration and interoperability B:Reference and master data C:Data Quality D:Data Warehousing and Business Intelligence E:Metadata Management 正确答案:C 你的答案:A
7.7. 29. The first morning an attribute is approaching an established data quality threshold the following should take place A:Establish a set of activites to address the emerging data quality issue B:consider reviewing the threshold value C:Press the emergency button situated under your desk D:Check for erroneous data in the effected reports E:Notify the data owner and advise them to establish a team of experts to investigate 正确答案:E 你的答案:A 解析:暂无解析
7.8. 30. which of the following is not a preventative action for creating high quality data A:Train data producers B:Establish data entry controls C:Institute formal data change control D:Implement data governance and stewardship E:Automated correction algorithms capable of detecting and correcting errors 正确答案:E 你的答案:E 解析:E不是预防
7.9. 33. A report displaying birth date contains possible but incorrect values. What is a possible explanation? A:Birth date is populated from a single source system; where the date field is an offset value of 1601 B:Birth date is populated from two source systems; one of which stores marriage date in the birth date field C:Birth date is populated from two source systems; both of which record the birth date in the birth date field D:Birth date is populated from a single source system which does not contain birth date E:Birth date is populated from a single source system; which contains missing values 正确答案:C 你的答案:E
Chapter 14: Big Data and Data Science 大数据与数据科学

1. Introduction
1.1. Intro

1.1.1. Data Science has existed for a long time; it used to be called ‘applied statistics’. But the capability to explore data patterns has quickly evolved in the twenty-first century with the advent of Big Data and the technologies that support it.
1.1.2. Traditional Business Intelligence provides ‘rear-view mirror’ 后视镜 reporting – analysis of structured data to describe past trends. In some cases, BI patterns are used to predict future behavior, but not with high confidence.
1.1.3. As Big Data has been brought into data warehousing and Business Intelligence environments, Data Science techniques are used to provide a forward-looking (‘windshield’ 挡风玻璃) view of the organization. Predictive capabilities, real-time and model-based, using different types of data sources, offer organizations better insight into where they are heading.
1.1.4. Most data warehouses are based on relational models. Big Data is not generally organized in a relational model. Most data warehousing depends on the concept of ETL (Extract, Transform, and Load). Big Data solutions, like data lakes, depend on the concept of ELT – loading and then transforming.
1. Reference data appears in all these EXCEPT A:a data content management tag list B:a thesaurus 词典 C:ETL code D:a pick list 拣选清单 E:ELT code 正确答案:C 你的答案:E 解析:14.1:ETL结果中产生参考数据。然而,要想利用大数据,就必须改变数据的管理方式,大多数数据仓库都基于关系模型,而大数据一般不采用关系模型组织数据。大多数数据仓库依赖于ETL(提取、转换和加载)的概念。大数据解决方案,如数据湖,则依赖于ELT的概念——先加载后转换。
1.2. Business Drivers
1.2.1. The biggest business driver for developing organizational capabilities around Big Data and Data Science is the desire to find and act on business opportunities that may be discovered through data sets generated through a diversified range of processes
1.2.2. Big Data can stimulate innovation by making more and larger data sets available for exploration. This data can be used to define predictive models that anticipate customer needs and enable personalized presentation of products and services.
1.2.3. Data Science can improve operations. Machine learning algorithms can automate complex time-consuming activities, thus improving organizational efficiency, reducing costs, and mitigating risks.
1.3. Principles
1.3.1. because of the wide variation in sources and formats, Big Data management will require more discipline than relational data management
4. Big data management requires: A:more discipline than relational data management B:big ideas with big budgets C:less discipline than relational data management D:a certification in data science E:no discipline at all 正确答案:A 你的答案:C 解析:14:大数据不仅指数据的量大,也指数据的种类多(结构化的和非结构化的,文档、文件、音频、视频、流数据等),以及数据产生的速度快。那些从数据中探究、研发预测模型、机器学习模型、规范性模型和分析方法并将研发结果进行部署供相关方分析的人,被称为数据科学家。
1.3.2. Organizations should carefully manage Metadata related to Big Data sources in order to have an accurate inventory of data files, their origins, and their value.
1.4. Essential Concepts
1.4.1. Data Science 数据科学
Developing Data Science solutions involves the iterative inclusion of data sources into models that develop insights
Data Science depends on:
1. Rich data sources: 丰富的数据源
Data with the potential to show otherwise invisible patterns in organizational orcustomer behavior
2. Information alignment and analysis: 信息组织和分析
Techniques to understand data content and combine data sets tohypothesize and test meaningful patterns
3. Information delivery: 信息交付
Running models and mathematical algorithms against the data and producingvisualizations and other output to gain insight into behavior
4. Presentation of findings and data insights: 展示发现和数据洞察
Analysis and presentation of findings so that insights canbe shared
1.4.2. The Data Science Process 数据科学的过程
The Data Science process follows the scientific method of refining knowledge by making observations, formulating and testing hypotheses, observing results, and formulating general theories that explain results.
Within Data Science, this process takes the form of observing data and creating and evaluating models of behavior:

1. Define Big Data strategy and business needs: Define the requirements that identify desired outcomeswith measurable tangible benefits.
2. Choose data sources: Identify gaps in the current data asset base and find data sources to fill thosegaps.
3. Acquire and ingest data sources: Obtain data sets and onboard them.
4. Develop Data Science hypotheses and methods: Explore data sources via profiling, visualization,mining, etc.; refine requirements. Define model algorithm inputs, types, or model hypotheses andmethods of analysis (i.e., groupings of data found by clustering, etc.).
5. Integrate and align data for analysis: Model feasibility depends in part on the quality of the sourcedata. Leverage trusted and credible sources. Apply appropriate data integration and cleansingtechniques to increase quality and usefulness of provisioned data sets.
6. Explore data using models: Apply statistical analysis and machine learning algorithms against theintegrated data. Validate, train, and over time, evolve the model. Training entails repeated runs of themodel against actual data to verify assumptions and make adjustments, such as identifying outliers.Through this process, requirements will be refined. Initial feasibility metrics guide evolution of themodel. New hypotheses may be introduced that require additional data sets and results of thisexploration will shape the future modeling and outputs (even changing the requirements).
7. Deploy and monitor: Those models that produce useful information can be deployed to production forongoing monitoring of value and effectiveness. Often Data Science projects turn into data warehousingprojects where more vigorous development processes are put in place (ETL, DQ, Master Data, etc.).
1.4.3. Big Data 大数据

Early efforts to define the meaning of Big Data characterized it in terms of the Three V’s: Volume, Velocity, Variety (Laney, 2001). As more organizations start to leverage the potential of Big Data, the list of V’s has expanded:
1. Volume 量大: Refers to the amount of data. Big Data often has thousands of entities or elements in billionsof records.
2. Velocity 更新频率大: Refers to the speed at which data is captured, generated, or shared. Big Data is oftengenerated and can also be distributed and even analyzed in real-time.
3. Variety / Variability 多样/可变: Refers to the forms in which data is captured or delivered. Big Data requiresstorage of multiple formats; data structure is often inconsistent within or across data sets.
4. Viscosity 粘度大: Refers to how difficult the data is to use or integrate.
5. Volatility 波动性大: Refers to how often data changes occur and therefore how long the data is useful.
6. Veracity 准确性低: Refers to how trustworthy the data is.
1.4.4. Big Data Architecture Components

The difference between ETL and ELT has significant implications for how data is managed.
1.4.5. Sources of Big Data
1.4.6. Data Lake 数据湖
A data lake is an environment where a vast amount of data of various types and structures can be ingested, stored, assessed, and analyzed.
Data lakes can serve many purposes. For example, providing
1. An environment for Data Scientists to mine and analyze data
2. A central storage area for raw data, with minimal, if any, transformation
3. Alternate storage for detailed historical data warehouse data
4. An online archive for records
5. An environment to ingest 提取 streaming data with automated pattern identification
The risk of a data lake is that it can quickly become a data swamp 数据沼泽 – messy 杂乱, unclean, and inconsistent. In order to establish an inventory of what is in a data lake, it is critical to manage Metadata as the data is ingested
1.4.7. Services-Based Architecture SBA 基于服务的架构
Services-based architecture (SBA) is emerging as a way to provide immediate (if not completely accurate or complete) data, as well as update a complete, accurate historical data set, using the same source (Abate, Aiken, Burke, 1997).

Batch layer 批处理层: A data lake serves as the batch layer, containing both recent and historical data
Speed layer 加速层: Contains only real-time data
Serving layer 服务层: Provides an interface to join data from the batch and speed layers
1.4.8. Machine Learning 机器学习
Machine Learning explores the construction and study of learning algorithms. These algorithms fall into three types:
Supervised learning 监督学习: Based on generalized rules; for example, separating SPAM from non-SPAM email
Unsupervised learning 无监督学习: Based on identifying hidden patterns (i.e., data mining)
Reinforcement learning 强化学习: Based on achieving a goal (e.g., beating an opponent at chess)
The need for transparency – the ability to see how decisions are made –will likely increase as this functionality evolves and is put to use in a wider array of situations.
3. "A machine learning algorithm that incorrectly classifies new data values may have a problem with population imbalances in:“ A:metadata management B:model training data C:big data architecture D:reference data management E:predictive analytics 正确答案:B 你的答案:B 解析:14.1.3:训练数据不平衡影响模型预测效果 8.机器学习机器学习(Machine Learning)探索了学习算法的构建和研究,它可以被视为无监督学习和监督学习方法的结合。无监督学习通常被称为数据挖掘,而监督学习是基于复杂的数学理论,特别是统计学、组合学和运筹学。第三个分支正处于形成过程中,称为强化学习,即没有通过教师的认可就实现了目标优化,如驾驶车辆。通过编程使机器可以快速地从查询中学习并适应不断变化的数据集,从而在大数据中引入一个全新的领域,称为机器学习。运行进程,存储结果,在后续运行中使用这些结果以迭代方式通知进程并优化结果。
1.4.9. Sentiment Analysis 语义分析
Media monitoring and text analysis are automated methods for retrieving insights from large unstructured or semi-structured data, such as transaction data, social media, blogs, and web news sites. This is used to understand what people say and feel about brands, products, or services, or other types of topics.
Using Natural Language Processing (NLP) or by parsing phrases or sentences, semantic analysis can detect sentiment and also reveal changes in sentiment to predict possible scenarios.
7. Sentiment analysis of call center voice files is performed by text analysis and stored in a relational database which of the following is true? A:The voice files are unstructured data and the sentiment analysis is structured data B:They are both structured data C:The voice files are structured data and the sentiment analysis is unstructured data D:Structured and unstructured data are the same thing E:They are both unstructured data 正确答案:A 你的答案:A 解析:14.1.3 .运营分析运营分析(Operational Analytics)数据集成和互操作是新兴大数据管理领域的核心。大数据旨在整合各种类型的数据,包括存储在数据库中的结构化数据、存储在文档或文件中的非结构化文本数据以及其他类型的非结构化数据,如音频、视频和流媒体数据。这种数据置合后可以板用采进行把掘、开发预测模型,并将其用于运营智能活动中。
1.4.10. Data and Text Mining 数据和文本挖掘
Data mining is a particular kind of analysis that reveals patterns in data using various algorithms.
6. You need to discover possible relationships or to show data patterns in an exploratory fashion when you do not necessarily have a specific question to ask. What kind of data tool would you use to identify patterns of data using various algorithms? A:Meta-Data Data Lineage vie B:Data Mining C:ETL Job D:Data Visualization Application E:Data Quality profile 正确答案:B 你的答案:B 解析:14.1.3:10.数据和文本挖掘数据挖掘(Data mining)是一种特殊的分析方法,它使用各种算法揭示数据中的规律。它最初是机器学习的一个分支,属于人工智能的一个子领
Text mining analyzes documents with text analysis and data mining techniques to classify content automatically into workflow-guided and SME-directed ontologies.
Data and text mining use a range of techniques, including:
Profiling:
Profiling attempts to characterize the typical behavior of an individual, group, or population.Profiling is used to establish behavioral norms for anomaly detection applications, such as frauddetection and monitoring for intrusions to computer systems. Profile results are inputs for manyunsupervised learning components.
Data reduction: 数据缩减
Data reduction replaces a large data set with a smaller set of data that contains muchof the important information in the larger set. The smaller data set may be easier to analyze or process.
Association: 数据关联
Association is an unsupervised learning process to find relationships between studiedelements based on transactions involving them. Examples of association include: Frequent item setmining, rule discovery, and market-based analysis. Recommendation systems on the internet use thisprocess as well.
Clustering: 数据聚类
Clustering group elements in a study together by their shared characteristics. Customersegmentation is an example of clustering.
Self-organizing maps: 自组织映射
Self-organizing maps are a neural network method of cluster analysis.Sometimes referred to as Kohonen Maps, or topologically ordered maps, they aim to reduce thedimensionality in the evaluation space while preserving distance and proximity relationships as muchas possible, akin to multi-dimension scaling. Reducing the dimensionality is like removing onevariable from the equation without violating the outcome. This makes it easier to solve and visualize.
1.4.11. Predictive Analytics 预测分析
Predictive Analytics is the sub-field of supervised learning where users attempt to model data elements and predict future outcomes through evaluation of probability estimates
5. An application that attempts to predict future outcomes through probability estimates is called: A:dimensional analytics B:predictive analytics C:just-in-time reporting D:descriptive analytics E:reactive analytics 正确答案:B 你的答案:B 解析:14.1.3:11.预测分析(Predictive Analytics)是有监督学习的子领域,用户尝试对数据元素进行建模,并通过评估概率估算来预测未来结果,预测分析深深植根于数学,特别是统计学,与无监督学习拥有许多相同的组成部分,对预期预测结果进行测量时差异是可控的。
1.4.12. Prescriptive Analytics 规范分析
Prescriptive analytics take predictive analytics a step farther to define actions that will affect outcomes, rather than just predicting the outcomes from actions that have occurred. Prescriptive analytics anticipates what will happen, when it will happen, and implies why it will happen
Because prescriptive analytics can show the implications of various decisions, it can suggest how to take advantage of an opportunity or avoid a risk. Prescriptive analytics can continually take in new data to re-predict and re-prescribe. This process can improve prediction accuracy and result in better prescriptions.
描述分析-诊断分析-预测分析-规范分析
1.4.13. Unstructured Data Analytics
Unstructured data analytics combines text mining, association, clustering, and other unsupervised learning techniques to codify large data sets. Supervised learning techniques can also be applied to provide orientation, oversight, and guidance in the coding process leveraging human intervention to resolve ambiguity when necessary.
1.4.14. Operational Analytics 运营分析
The concept of operational analytics (also known as operational BI or streaming analytics) has emerged from the integration of real-time analytics into operations. Operational analytics includes activities like user segmentation, sentiment analysis, geocoding, and other techniques applied to data sets for marketing campaign analysis, sales penetration, product adoption, asset optimization, and risk management.
1.4.15. Data Visualization
Visualization is the process of interpreting concepts, ideas, and facts by using pictures or graphical representations. Data visualization facilitates understanding of the underlying data by presenting it in a visual summary, such as a chart or graph. Data visualizations condense and encapsulate characteristics data, making them easier to see. In doing so, they can surface opportunities, identify risks, or highlight messages.
Traditional BI tools include visualization options such as tables, pie charts, lines charts, area charts, bar charts, histograms, and turnkey boxes (candlesticks).
1.4.16. Data Mashups 数据混搭
Mashups combine data and services to create visualization for insight or analysis.
2. Activities
2.1. Define Big Data Strategy and Business Needs
2.1.1. An organization’s Big Data strategy needs to be aligned with and support its overall business strategy and business requirements and be part of its data strategy. A Big Data strategy must include criteria to evaluate:
1. What problems the organization is trying to solve. What it needs analytics for
2. What data sources to use or acquire
3. The timeliness and scope of the data to provision
4. The impact on and relation to other data structures
5. Influences to existing modeled data
2.2. Choose Data Sources
2.2.1. Big Data environments make it possible to quickly ingest lots of data, but to use that data and manage it over time, it is still necessary to know basic facts:
Its origin
Its format
What the data elements represent
How it connects to other data
How frequently it will be updated
2.2.2. data needs to be evaluated for worth and reliability. Review the available data sources, and the processes that create those sources and manage the plan for new sources.
Foundational data 基础数据: Consider foundational data components such as POS (Point of Sale) in a salesanalysis.
Granularity 粒度: Ideally, obtain data in its most granular form (not aggregated). That way it can beaggregated for a range of purposes.
Consistency: If possible, select data that will appear appropriately and consistently acrossvisualizations, or recognize limitations.
Reliability 可靠性: Choose data sources that are meaningful and credible over time. Use trusted, authoritativesources.
Inspect/profile new sources 检查分析新数据源: Test changes before adding new data sets. Unexpected material orsignificant changes in visualization outcomes can occur with the inclusion of new data sources.
2.2.3. Risks associated with data sources include privacy concerns.
2.2.4. Criteria used to select or filter data also pose a risk. These criteria should be objectively managed to avoid biases or skews 偏见和偏差
2.3. Acquire and Ingest Data Sources
2.3.1. Before integrating the data, assess its quality.
2.4. Develop Data Hypotheses and Methods
2.4.1. Models depend on both the quality of input data and the soundness of the model itself.
2.5. Integrate / Align Data for Analysis
2.5.1. In many cases, joining data sources is more an art than a science.
One method is to use a common model that integrates the data using a common key.
Another way is to scan and join data using indexes within the database engines for similarity and record linkage algorithms and methods.
2.6. Explore Data Using Models
2.6.1. Populate Predictive Model 填充预测
Configuring predictive models includes pre-populating the model with historical information concerning the customer, market, products, or other factors that are included in the model other than the triggering factor.
2.6.2. Train the Model 训练
Execute the model against the data in order to ‘train’ the model. Training includes repeated runs of the model against the data to verify assumptions. Training will result in changes to the model. Training requires balance. Avoid over-fitting by training against a limited data fol
2.6.3. Evaluate Model 评估
There is an ethical component to practicing Data Science and it needs to be applied when evaluating models.
Models can have unexpected results or unintentionally reflect the assumptions and biases of the people who create them. Ethical training should be required for all artificial intelligence (AI) practitioners.
2.6.4. Create Data Visualizations
Data visualization based on the model must meet the specific needs related to the purpose of the model.
Select the appropriate visual to fulfill that purpose.
Ensure that the visualization addresses an audience
adjust the layout and complexity to highlight and simplify accordingly
Visualizations should tell a story.
2.7. Deploy and Monitor
2.7.1. Expose Insights and Findings 揭示洞察和发现
2.7.2. Iterate with Additional Data Sources 使用附加数据源迭代
3. Tools
3.1. MPP Shared-nothing Technologies and Architecture 大规模并行处理的无共享数据库技术和架构
3.1.1. Massively Parallel Processing (MPP) Shared-nothing Database technologies have become the standard platform for Data Science-oriented analysis of Big Data sets. In MPP databases, data is partitioned (logically distributed) across multiple processing servers (computational nodes), with each server having its own dedicated memory to process data locally. Communication between processing servers is usually controlled by a master host and occurs over a network interconnect. There is no disk sharing or memory contention, hence the name, ‘shared-nothing’.

3.1.2. 2. A distributed data warehouse A:is a logical concept only B:is a number of data warehouses. C:is a single database containing subsets of the data D:has components linked in different physical databases E:AIl 正确答案:D 你的答案:D 解析:14.3.1.:大规模并行处理(MPP)的无共享数据库技术,已成为面向数据科学的大数据集分析标准平台。在MPP数据库中,数据在多个处理服务器(计算节点)之间进行分区(逻辑分布),每个服务器都有自己的专用内存来处理本地数据。处理服务器之间的通信通常由管理节点控制,并通过网络互联进行。因为该架构没有磁盘共享,也不发生内存争用,因此称作“无共享”。
3.2. Distributed File-based Databases 基于分布式文件的数据库
3.2.1. Distributed file-based solutions technologies, such as the open source Hadoop, are an inexpensive way to store large amounts of data in different formats.
3.2.2. The language used in file-based solutions is called MapReduce. This language has three main steps:
1. Map 映射: Identify and obtain the data to be analyzed
2. Shuffle 洗牌: Combine the data according to the analytical patterns desired
3. Reduce 归并: Remove duplication or perform aggregation in order to reduce the size of the resulting dataset to only what is required
3.3. In-database Algorithms 数据库内算法
3.3.1. An in-database algorithm uses the principle that each of the processors in a MPP Shared-nothing platform can run queries independently, so a new form of analytics processing could be accomplished by providing mathematical and statistical functions at the computing node level.
3.4. Big Data Cloud Solutions
3.4.1. There are vendors who provide cloud storage and integration for Big Data, including analytic capabilities.
3.5. Statistical Computing and Graphical Languages 统计计算和图形语言
3.5.1. R is an open source scripting language and environment for statistical computing and graphics. It provides a wide variety of statistical techniques such as linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, and clustering.
3.6. Data Visualization Tools
3.6.1. These tools have several benefits over traditional visualization tools:
Sophisticated analysis and visualization types, such as small multiples, spark lines, heat maps,histograms, waterfall charts, and bullet graphs
Built-in adherence to visualization best practices
Interactivity enabling visual discovery
4. Techniques
4.1. Analytic Modeling 解析建模
4.1.1. Real-time access can resolve many latency issues from batch processing. The Apache Mahout is an open source project aimed at creating a machine-learning library. Mahout is positioned to automate Big Data exploration through recommendation mining, document classification, and item clustering. This branch of development efforts bypasses the traditional batch query MapReduce data access techniques. Leveraging an API interface directly into the storage layer HDFS, a variety of data access techniques can be provided such as SQL, content streaming, machine learning, and graphics libraries for data visualization.
4.1.2. Analytic models are associated with different depths of analysis:
Descriptive modeling 描述性建模 summarizes or represents the data structures in a compact manner. Thisapproach does not always validate a causal hypothesis or predict outcomes. However, it does usealgorithms to define or refine relationships across variables in a way that could provide input to suchanalysis.
Explanatory modeling 解释性建模 is the application of statistical models to data for testing causal hypothesisabout theoretical constructs. While it uses techniques similar to data mining and predictive analytics, its purpose is different. It does not predict outcomes; it seeks to match model results only with existing data.
4.2. Big Data Modeling 大数据建模
4.2.1. Understand how the data links between data sets. For data of different granularity, prevent combinations that count data elements or values more than once; for example, don’t combine atomic and aggregate sets.
5. Implementation Guidelines
5.1. Strategy Alignment
5.1.1. Any Big Data / Data Science program should be strategically aligned with organizational objectives. Establishing a Big Data strategy drives activities related to user community, data security, Metadata management, including lineage, and Data Quality Management.
5.1.2. Strategy deliverables should account for managing:
1. Information lifecycle
2. Metadata
3. Data quality
4. Data acquisition
5. Data access and security
6. Data governance
7. Data privacy
8. Learning and adoption
9. Operations
5.2. Readiness Assessment / Risk Assessment
5.2.1. As with any development project, implementation of a Big Data or Data Science initiative should align with real business needs.
5.2.2. Assess organizational readiness in relation to critical success factors:
1. Business relevance: How well do the Big Data / Data Science initiatives and their corresponding usecases align with the company’s business? To succeed, they must strongly enforce a business functionor process.
2. Business readiness: Is the business partner prepared for a long-term incremental delivery? Have theycommitted themselves to establishing centers of excellence to sustain the product in future releases?How broad is the average knowledge or skill gap within the target community and can that be crossedwithin a single increment?
3. Economic viability 经济可行性: Has the proposed solution considered conservatively the tangible and intangiblebenefits? Has assessment of ownership costs accounted for the option of buying or leasing items versusbuilding from scratch?
4. Prototype: Can the proposed solution be prototyped for a subset of the end user community for a finitetimeframe to demonstrate proposed value? Big bang implementations can cause big dollar impacts anda proving ground can mitigate these delivery risks.
5.2.3. Likely the most challenging decisions will be around data procurement, platform development, and resourcing.
Many sources exist for digital data stores and not all need to be in-house owned and operated. Somecan be procured while others can be leased.
Multiple tools and techniques are on the market; matching to general needs will be a challenge.
Securing staff with specific skills in a timely manner and retaining top talent during an implementationmay require consideration of alternatives including professional services, cloud sourcing orcollaborating.
The time to build in-house talent may well exceed the delivery window.
5.3. Organization and Cultural Change
5.3.1. Business people must be fully engaged in order to realize benefits from the advanced analytics. A communications and education program is required to affect this.
5.3.2. As with DW/BI, a Big Data implementation will bring together of a number of key cross-functional roles, including
Big Data Platform Architect 大数据平台架构师: Hardware, operating systems, filesystems, and services.
Ingestion Architect 数据摄取架构师: Data analysis, systems of record, data modeling, and data mapping. Provides orsupports mapping of sources to the Hadoop cluster for query and analysis.
Metadata Specialist 元数据专家: Metadata interfaces, Metadata architecture, and contents.
Analytic Design Lead 分析设计主管: End user analytic design, best practice guidance implementation in relatedtoolsets, and end user result set facilitation.
Data Scientist 数据科学家: Provides architecture and model design consultation based on theoretical knowledge ofstatistics and computability, delivery on appropriate tools and technical application to functionalrequirements.
6. Big Data and Data Science Governance
6.1. Big Data, like other data, requires governance. Sourcing, source analysis, ingestion, enrichment, and publishing processes require business as well as technical controls, addressing such questions as:
6.1.1. Sourcing: What to source, when to source, what is the best source of data for particular study
6.1.2. Sharing: What data sharing agreements and contracts to enter into, terms and conditions both insideand outside the organization
6.1.3. Metadata: What the data means on the source side, how to interpret the results on the output side
6.1.4. Enrichment: Whether to enrich the data, how to enrich the data, and the benefits of enriching the data
6.1.5. Access: What to publish, to whom, how, and when
6.2. Visualization Channels Management
6.2.1. A critical success factor in implementing a Data Science approach is the alignment of the appropriate visualization tools to the user community.
6.3. Data Science and Visualization Standards
6.3.1. It is a best practice to establish a community that defines and publishes visualization standards and guidelines and reviews artifacts within a specified delivery method; this is particularly vital for customer- and regulatory-facing content.
6.3.2. Standards may include:
Tools standards by analytic paradigm, user community, subject area
Requests for new data
Data set process standard
Processes for neutral and expert presentation to avoid biased results, and to ensure that all elementsincluded have been done so in a fair and consistent manner including:
Data inclusion and exclusion
Assumptions in the models
Statistical validity of results
Validity of interpretation of results
Appropriate methods applied
6.4. Data Security
6.4.1. Having a reliable process to secure data is itself an organizational asset
6.5. Metadata
6.5.1. As part of a Big Data initiative, an organization will bring together data sets that were created using different approaches and standards
6.6. Data Quality
6.6.1. Data Quality is a measure of deviation from an expected result: the smaller the difference, the better the data meets expectation, and the higher the quality.
6.6.2. Consider that most mature Big Data organizations scan data input sources using data quality toolsets to understand the information contained within. Most advanced data quality toolsets offer functionality that enables an organization to test assumptions and build knowledge about its data. For example:
Discovery: Where information resides within the data set
Classification: What types of information are present based upon standardized patterns
Profiling: How the data is populated and structured
Mapping: What other data sets can be matched to these values
6.7. Metrics
6.7.1. Metrics are vital to any management process; they not only quantify activity, but can define the variation between what is observed and what is desired
6.7.2. Technical Usage Metrics 技术使用指标
6.7.3. Loading and Scanning Metrics 加载和扫描指标
Loading and scanning metrics define the ingestion rate and interaction with the user community
6.7.4. Learnings and Stories 学习和故事场景
Common measurements include
Counts and accuracy of models and patterns developed
Revenue realization from identified opportunities
Cost reduction from avoiding identified threats
7. Works Cited / Recommended
Chapter 15: Data Management Maturity Assessment 数据管理成熟度评估

1. Introduction
1.1. Intro
1.1.1. Capability Maturity Model (CMM) – that describes how characteristics of a process evolve from ad hoc to optimal.
Level 0: Absence of capability
Level 1: Initial or Ad Hoc 初始级: Success depends on the competence of individuals
Level 2: Repeatable 可重复级: Minimum process discipline is in place
Level 3: Defined 已定义级: Standards are set and used
Level 4: Managed: Processes are quantified and controlled
Level 5: Optimized: Process improvement goals are quantified
1.1.2. A Data Management Maturity Assessment (DMMA) can be used to evaluate data management overall, or it can be used to focus on a single Knowledge Area or even a single process
1.2. Business Drivers
1.2.1. Regulation: Regulatory oversight requires minimum levels of maturity in data management.
1.2.2. Data Governance: The data governance function requires a maturity assessment for planning andcompliance purposes.
1.2.3. Organizational readiness for process improvement: An organization recognizes a need to improveits practices and begins by assessing its current state. For example, it makes a commitment to manageMaster Data and needs to assess its readiness to deploy MDM processes and tools.
1.2.4. Organizational change: An organizational change, such as a merger, presents data managementchallenges. A DMMA provides input for planning to meet these challenges.
1.2.5. New technology: Advancements in technology offers new ways to manage and use data. Theorganization wants to understand the likelihood of successful adoption.
1.2.6. Data management issues: There is need to address data quality issues or other data managementchallenges and the organization wants to baseline its current state in order to make better decisionsabout how to implement change
1.3. Goals and Principles
1.3.1. The primary goal of a data management capability assessment is to evaluate the current state of critical data management activities in order to plan for improvement.
1.3.2. In meeting its primary goal, a DMMA can have a positive impact on culture. It helps:
1. Educate stakeholders about data management concepts, principles, and practices
2. Clarify stakeholder roles and responsibilities in relation to organizational data
3. Highlight the need to manage data as a critical asset
4. Broaden recognition of data management activities across the organization
5. Contribute to improving the collaboration necessary for effective data governance
1.3.3. A DMMA can equip the organization to develop a cohesive vision that supports overall organizational strategy.
1.3.4. A DMMA enables the organization to clarify priorities, crystalize objectives, and develop an integrated plan for improvement.
1.4. Essential Concepts
1.4.1. Assessment Levels and Characteristics 等级和特点

Level 0: No Capability
Level 1 Initial / Ad Hoc:
Assessment criteria may include the presence of any process controls, such as logging of data quality issues.
Level 2 Repeatable
Assessment criteria might include formal role definition in artifacts like job descriptions, the existence of process documentation, and the capacity to leverage tool sets.
Level 3 Defined
Assessment criteria might include the existence of data management policies, the use of scalable processes, and the consistency of data models and system controls.
Level 4 Managed
Assessment criteria might include metrics related to project success, operational metrics for systems, and data quality metrics.
Level 5: Optimization
Assessment criteria might include change management artifacts and metrics on process improvement
1.4.2. Assessment Criteria 评估标准
When assessing using a model that can be mapped to a DAMA-DMBOK Data Management Knowledge Area, criteria could be formulated based on the categories in the Context Diagram:

Activity
Tools:
Standards
People and resources
1.4.3. Existing DMMA Frameworks
1. DCMM
China
2. CMMI Data Management Maturity Model (DMM)
The CMMI (Capability Maturity Model Institute) has developed the CMMI-DMM (Data Management Maturity Model) which provides assessment criteria for the following data management areas:
Data Management Strategy
Data Governance
Data Quality
Platform and Architecture
Data Operations
Supporting Processes
3. EDM Council DCAM
The Enterprise Data Management Council, an industry advocacy organization for financial services headquartered in the United States, has developed the DCAM (Data Management Capability Assessment Model). The result of a membership-driven effort to get consensus on data management best practices, the DCAM describes 37 capabilities and 115 sub-capabilities associated with the development of a sustainable Data Management program.
4. IBM Data Governance Council Maturity Model
The IBM Data Governance Council Maturity Model was based on input from a council of 55 organizations. Council members collaborated to define a common set of observable and desired behaviors that organizations can use to evaluate and design their own data governance programs. The purpose of the model is to help organizations build consistency and quality control in governance through proven business technologies, collaborative methods, and best practices. The model is organized around four key categories
Outcomes: Data risk management and compliance, value creation
Enablers: Organizational structure and awareness, policy, stewardship
Core disciplines: Data Quality Management, information lifecycle management, information securityand privacy
Supporting Disciplines: Data Architecture, classification and Metadata, audit information, loggingand reporting
5. Stanford Data Governance Maturity Model
The Stanford Data Governance Maturity Model was developed for use by the University; it was not intended to be an industry standard. Even still, it serves as a solid example of a model that provides guidance and a standard of measurement.
6. Gartner’s Enterprise Information Management Maturity Model
Gartner has published an EIM maturity model, which establishes criteria for evaluating vision, strategy, metrics, governance, roles and responsibilities, lifecycle, and infrastructure.
2. Activities
2.1. The purpose of the evaluation is expose current strengths and opportunities for improvement – not to solve problems.
2.2. Plan Assessment Activities
2.2.1. Planning for an assessment includes defining the overall approach and communicating with stakeholders before and during the assessment to ensure they are engaged. The assessment itself includes collecting and evaluating inputs and communicating results, recommendations, and action plans.
2.2.2. Define Objectives
Any organization that decides it should assess its data management maturity level is already engaged in the effort to improve its practices. In most cases, such an organization will have identified the drivers for the assessment. These drivers must be clarified in the form of objectives that describe the focus and influence the scope of the assessment. The objectives for the assessment must be clearly understood by executives and the lines of business, who can help ensure alignment with the organization’s strategic direction.
2.2.3. Choose a Framework
Review these frameworks in the context of assumptions about current state and assessment objectives in order to choose one that will inform the organization in meaningful ways. Focus areas of the assessment model can be customized based on organizational focus or scope.
2.2.4. Define Organizational Scope
Localized assessments 局部评估
Enterprise assessments 企业评估
2.2.5. Define Interaction Approach
Information gathering activities may include workshops, interviews, surveys, and artifact reviews.
Employ methods that work well within the organizational culture, minimize the time commitment from participants
In all cases, responses will need to be formalized by having participants rate the assessment criteria
2.2.6. Plan Communications
Communications contribute to the overall success of the assessment and the action items coming out of it. Communication will be directed at participants and other stakeholders.
Findings may impact people’s jobs,
Before the assessment begins, stakeholders should be informed about expectations for the assessment. Communications should describe:
1. The purpose of the DMMA
2. How it will be conducted
3. What their involvement may be
4. The schedule of assessment activities
2.3. Perform Maturity Assessment
2.3.1. Gather Information
the information gathered will include formal ratings of assessment criteria. It may also include input from interviews and focus groups, system analysis and design documentation, data investigation, email strings, procedure manuals, standards, policies, file repositories, approval workflows, various work products, Metadata repositories, data and integration reference architectures, templates, and forms.
2.3.2. Perform the Assessment
The overall rating assignments and interpretation are typically multi-phased. Participants will have different opinions generating different ratings across the assessment topics
If stakeholders do not have consensus on current state, it is difficult to have consensus on how to improve the organization.
The refinement generally works as follows:
1. Review results against the rating method and assign a preliminary rating to each work product oractivity.
2. Document the supporting evidence.
3. Review with participants to come to consensus on a final rating for each area. If appropriate, useweight modifiers based on the importance of each criterion.
4. Document the interpretation of the rating using the model criteria statements and assessor comments.
5. Develop visualizations to illustrate results of the assessment.
2.4. Interpret Results
2.4.1. When presenting assessment results, start with the meaning of the ratings for the organization. The ratings can be expressed with respect to organizational and cultural drivers as well as business goals, such as customer satisfaction or increased sales.
2.4.2. Report Assessment Results
The assessment report is an input to the enhancement of the Data Management program, either as a whole or by Data Management Knowledge Area. From it, the organization can develop or advance its data management strategy. Strategy should include initiatives that further business goals through improved governance of processes and standards
The assessment report should include:
1. Business drivers for the assessment
2. Overall results of the assessment
3. Ratings by topic with gaps indicated
4. A recommended approach to close gaps
5. Strengths of the organization as observed
6. Risks to progress
7. Investment and outcomes options
8. Governance and metrics to measure progress
9. Resource analysis and potential future utilization
10. Artifacts that can be used or re-used within the organization
2.4.3. Develop Executive Briefings
The assessment team should prepare executive briefings that summarize findings – strengths, gaps, and recommendations – that executives will use as input to decisions about targets, initiatives, and timelines.
The team must tailor the messages to clarify likely impacts and benefits for each executive group.
2.5. Create a Targeted Program for Improvements
2.5.1. Identify Actions and Create a Roadmap
The roadmap will give targets and a pace for change within prioritized work streams, and accompanied by an approach for measuring progress.
The roadmap or reference plan should contain:
Sequenced activities to effect improvements in specific data management functions
A timeline for implementing improvement activities
Expected improvements in DMMA ratings once activities have been implemented
Oversight activities, including the maturing this oversight over the timeline
2.6. Re-assess Maturity
2.6.1. Re-assessments should be conducted at regular intervals. They are part of the cycle of continuous improvement:
Establish a baseline rating through the first assessment
Define re-assessment parameters, including organizational scope
Repeat DMM assessment as necessary on a published schedule
Track trends relative to the initial baseline
Develop recommendations based on the re-assessment findings
2.6.2. Re-assessment can also re-invigorate or refocus effort. Measurable progress assists in maintaining commitment and enthusiasm across the organization. Changes to regulatory frameworks, internal or external policy, or innovations that could change the approach to governance and strategies are additional reasons to re-assess periodically.
3. Tools
3.1. Data Management Maturity Framework:
3.1.1. The primary tool used in a maturity assessment is the DMM framework itself.
3.2. Communication Plan:
3.2.1. A communication plan includes an engagement model for stakeholders, thetype of information to be shared, and the schedule for sharing information.
3.3. Collaboration Tools:
3.3.1. Collaboration tools allow findings from the assessment to be shared. In addition,evidence of data management practices may be found in email, completed templates, and review documents created via standard processes for collaborative design, operations, incident tracking,reviews, and approvals.
3.4. Knowledge Management and Metadata Repositories:
3.4.1. Data standards, policies, methods, agendas,minutes of meetings or decisions, and business and technical artifacts that serve as proof of practicemay be managed in these repositories. In some CMMs, lack of such repositories is an indicator oflower maturity in the organization.
3.4.2. Metadata repositories can exist in several constructs, which maynot be obvious to the participants. For example, some Business Intelligence applications relycompletely on Metadata to compile their views and reports, while not referring to it as a separatedistinct repository.
4. Techniques
4.1. Selecting a DMM Framework
4.1.1. The following criteria should be considered when selecting a DMM framework.
1. Accessibility: Practices are stated in non-technical terms that convey the functional essence of theactivity.
2. Comprehensiveness 全面性: The framework addresses a broad scope of data management activities andincludes business engagement, not merely IT processes.
3. Extensible and flexible: The model is structured to enable enhancement of industry-specific oradditional disciplines and can be used either in whole or in part, depending on the needs of theorganization.
4. Future progress path built-in: While specific priorities differ from organization to organization, theDMM framework outlines a logical way forward within each of the functions it describes.
5. Industry-agnostic vs. industry-specific 行业不可知论和行业特定论: Some organizations will benefit from an industry-specificapproach, others from a more generic framework. Any DMM framework should also adhere to datamanagement best practices that cross verticals.
6. Level of abstraction or detail: Practices and evaluation criteria are expressed at a sufficient level ofdetail to ensure that they can be related to the organization and the work it performs.
7. Non-prescriptive: The framework describes what needs to be performed, not how it must beperformed.
8. Organized by topic: The framework places data management activities in their appropriate context,enabling each to be evaluated separately, while recognizing the dependencies.
9. Repeatable: The framework can be consistently interpreted, supporting repeatable results to comparean organization against others in its industry and to track progress over time.
10. Supported by a neutral, independent organization: The model should be vendor neutral in order toavoid conflicts of interest, and widely available to ensure a broad representation of best practices.
11. Technology neutral: The focus of the model should be on practices, rather than tools.
12. Training support included: The model is supported by comprehensive training to enableprofessionals to master the framework and optimize its use.
4.2. DAMA-DMBOK Framework Use
4.2.1. The DAMA-DMBOK can be used to prepare for or establish criteria for a DMMA.
5. Guidelines for a DMMA
5.1. Readiness Assessment / Risk Assessment
5.1.1. Before conducting a maturity assessment, it is helpful to identify potential risks and some risk mitigation strategies.

5.2. Organizational and Cultural Change
5.2.1. Organizational and cultural transformation begins with acknowledging that things can be better.
5.2.2. DMMA results can coalesce differing perspectives, result in a shared vision, and accelerate an organization’s progress
6. Maturity Management Governance
6.1. DMMA Process Oversight 过程监督
6.1.1. Oversight for the DMMA process belongs to the Data Governance team. If formal Data Governance is not in place, then oversight defaults to the steering committee or management layer that initiated the DMMA.
6.1.2. The breadth and depth of oversight depend on the DMMA’s scope. Each function involved in the process has a voice in the execution, method, results, and roadmaps coming from the overall assessment.
6.2. Metrics
6.2.1. Sample metrics could include:
DMMA ratings:
DMMA ratings present a snapshot of the organization’s capability level. The ratingsmay be accompanied by a description, perhaps a custom weighting for the rating across an assessmentor specific topic area, and a recommended target state.
Resource utilization rates:
Powerful examples of metrics that help express the cost of datamanagement in the form of head count. An example of this type of metric is: “Every resource in theorganization spends 10% of their time manually aggregating data.”
Risk exposure 风险敞口
or the ability to respond to risk scenarios expresses an organization’s capabilitiesrelative to their DMMA ratings. For example, if an organization wanted to begin a new business thatrequired a high level of automation but their current operating model is based on manual datamanagement (Level 1), they would be at risk of not delivering.
Spend management
expresses how the cost of data management is allocated across an organizationand identifies the impacts of this cost on sustainability and value. These metrics overlap with datagovernance metrics.
1. Data management sustainability
2. Achievement f initiative gals and bjectives
3. Effectiveness f cmmunicatin
4. Effectiveness f educatin and training
5. Speed f change adptin
6. Data management value
7. Cntributins t business bjectives
8. Reductins in risks
9. Imprved efficiency in peratins
Inputs to the DMMA
are important to manage as they speak to the completeness of coverage, level ofinvestigation, and detail of the scope relevant for interpretation of the scoring results. Core inputscould include the following: count, coverage, availability, number of systems, data volumes, teamsinvolved, etc.
Rate of Change
The rate at which an organization is improving its capability. A baseline is establishedthrough the DMMA. Periodic reassessment is used to trend improvement.
7. Works Cited / Recommended
7.1. 1. information lifecycle elements include all of the following EXCEPT A:create/ receive B:distribute/use dispose.处理 C:maintain/ preserve D:associate副 /model E:none 正确答案:D 你的答案:D 解析:D不是流程和动作:15.1.3(3)IBM数据治理委员会成熟度模型3)核心内容。数据质量管理、信息生命周期管理、信息安全和隐私。
Chapter 16: Data Management Organization and Role Expectations 数据管理组织与角色期望
1. Introduction
1.1. The data landscape is quickly evolving and with it, organizations need to evolve the ways they manage and govern data. Data management and data governance organizations must be flexible enough to work effectively in this evolving environment. To do so, they need to clarify basic questions about ownership, collaboration, accountability 责权利, and decision-making.
1.1.1. 4. Roles in Data Governance programs require specification for A:responsibility, authority and accountability. B:accountability,alignment and delegation C:delegation authority and responsibility. D:authority responsibility E:none 正确答案:A 你的答案:A 解析:责任,权力,义务
1.2. There is no perfect organizational structure for either. While common principles should be applied to organizing around data governance and data management, much of the detail will depend on the drivers of that enterprise’s industry and the corporate culture of the enterprise itself.
2. Understand Existing Organization and Cultural Norms
2.1. Awareness, ownership, and accountability are the keys to activating and engaging people in data management initiatives, policies, and processes.
2.2. Before defining any new organization or attempting to improve an existing one, it is important to understand current state of component pieces, related to culture, the existing operating model, and people 认责模型

3. Data Management Organizational Constructs
3.1. Decentralized Operating Model
3.1.1. In a decentralized model, data management responsibilities are distributed across different lines of business and IT (see Figure 107).

3.2. Network Operating Model
3.2.1. Decentralized informality can be made more formal through a documented series of connections and accountabilities via a RACI (Responsible, Accountable, Consulted, and Informed) matrix. This is called a networked model because it operates as a series of known connections between people and roles and can be diagrammed as a ‘network.’

3.3. Centralized Operating Model
3.3.1. The most formal and mature data management operating model is a centralized one (see Figure 109).
Here everything is owned by the Data Management Organization. Those involved in governing and managing data report directly to a data management leader who is responsible for Governance, Stewardship, Metadata Management, Data Quality Management, Master and Reference Data Management, Data Architecture, Business Analysis, et

3.4. Hybrid Operating Model
3.4.1. the hybrid operating model encompasses benefits of both the decentralized and centralized models

3.5. Federated Operating Model
3.5.1. A variation on the hybrid operating model, the federated model provides additional layers of centralization / decentralization, which are often required in large global enterprises

3.6. Identifying the Best Model for an Organization
3.6.1. The operating model is a starting point for improving data management and data governance practices. Introducing it requires an understanding of how it may impact the current organization and how it will likely need to evolve over time. Since the operating model will serve as the structure through which policies and processes will be defined, approved, and executed, it is critical to identify the best fit for an organization
3.7. DMO Alternatives and Design Considerations
3.7.1. Whichever model is chosen, remember that simplicity and usability are essential for acceptance and sustainability.
3.7.2. Keep these tips in mind when constructing an Operating Model:
1. Determine the starting point by assessing current state
2. Tie the operating model to organization structure
3. Take into account:
Organization Complexity + Maturity
Domain Complexity + Maturity
Scalability
4. Get executive sponsorship – a must for a sustainable model
5. Ensure that any leadership forum (steering committee, advisory council, board) is a decision-makingbody
6. Consider pilot programs and waves of implementation
7. Focus on high-value, high-impact data domains
5. A data quality program 质量计划 should limit its scope to A:the data that changes most often B:all the data stored in the enterprise C:the highest profile program with the best benefits D:the data that is of interest to the chief Executive officer E:the data most critical to the enterprise and its customers 正确答案:E 你的答案:E 解析:16.6.3:数据质量计划可以演变为与总体数据管理计划类似的运营模式,尽管任何规模的公司都极少能够完全集中数据质量职能。在大多数情况下,数据质量的很多方面都是在单一业务线或应用程序中执行的。由于数据质量规划可以是分散式的、网络式的或混合式的(使用卓越中心方法),因此可将数据质量运营模式与整个数据管理组织的运营模式保持一致,使用一致的利益相关方、关系、责任、标准、流程甚至工具。
8. Use what already exists
9. Never take a One-Size-Fits-All 一刀切 approach
4. Critical Success Factors
4.1. Executive Sponsorship
4.1.1. Having the right executive sponsor ensures that stakeholders affected by a Data Management program
4.2. Clear Vision
4.2.1. A clear vision for the Data Management Organization, along with a plan to drive it, is critical to success.
4.3. Proactive Change Management
4.3.1. Managing the change associated with creating a Data Management Organization requires planning for, managing, and sustaining change.
4.4. Leadership Alignment
4.4.1. Leadership alignment ensures that there is agreement on – and unified support for – the need for a Data Management program and that there is agreement on how success will be defined.
4.5. Communication
4.5.1. Communication should start early and continue openly and often.
4.5.2. 3. How do data management professionals maintain commitment of key stakeholders to the data management initiative? A:Continuous communication education promotion of the importance and value of data and information assets B:Rely on the stakeholder group to be self-sustaining C:weekly email reports showing metrics on data management progress/lack thereof D:Find and deliver benefits to the stakeholders early in the initiative E:lt is not necessary as the stakeholders signed up at the beginning of the program 正确答案:A 你的答案:A 解析:题解:16.4关键成功因素无论数据管理组织的架构如何,有10个因素始终被证明对其成功发挥着关键作用:5)持续沟通。6)利益相关方的参与。7)指导和培训。
4.6. Stakeholder Engagement
4.6.1. How the organization engages these stakeholders – how they communicate with, respond to, and involve them – will have a significant impact on the success of the initiative.
4.7. Orientation and Training
4.7.1. Education is essential to making data management happen, although different groups will require different types and levels of education.
4.8. Adoption Measurement
4.8.1. It is important to build metrics around the progress and adoption of the data management guidelines and plan to know that the data management roadmap is working and that it will continue working
4.9. Adherence to Guiding Principles 坚持指导原则
4.9.1. guiding principle is a statement that articulates shared organizational values, underlies strategic vision and mission, and serves as a basis for integrated decision-making. Guiding principles constitute the rules, constraints, overriding criteria, and behaviors
4.10. Evolution Not Revolution 演进而非革命
4.10.1. In all aspects of data management, the philosophy of ‘evolution not revolution’ helps to minimize big changes or large-scale high-risk projects.
5. Build the Data Management Organization
5.1. Identify Current Data Management Participants
5.1.1. When implementing the operating model, start with teams already engaged in data management activities. This will minimize the effect on the organization and will help to ensure that the focus of the team is data, not HR or politics.
5.2. Identify Committee Participants
5.2.1. No matter which operating model an organization chooses, some governance work will need to be done by a Data Governance Steering Committee and by working groups. It is important to get the right people on the Steering Committee and to use their time well.
5.3. Identify and Analyze Stakeholders
5.3.1. A stakeholder is any person or group who can influence or be affected by the Data Management program. Stakeholders can be internal to or external to the organization
5.3.2. A stakeholder analysis will help answer questions like:
Who will be affected by data management?
How will roles and responsibilities shift?
How might those affected respond to the changes?
What issues and concerns will people have?
5.3.3. what needs to be done to bring along critical stakeholders,

1. Who controls critical resources
2. Who could block data management initiatives, either directly or indirectly
3. Who could influence other critical constituents
4. How supportive stakeholders are of the upcoming changes
5.4. Involve the Stakeholders
5.4.1. After identifying the stakeholders and a good Executive Sponsor, or a short list from which to choose, it is important to clearly articulate why each of the stakeholders should be involved.
6. Interactions Between the DMO and Other Data-oriented Bodies
6.1. The Chief Data Officer
6.1.1. While most companies recognize at some level that data is a valuable corporate asset, only a few have appointed a Chief Data Officer (CDO) to help bridge the gap between technology and business and evangelize an enterprise-wide data management strategy at a senior level.
6.2. Data Governance
6.2.1. Within a centralized model, the Data Governance Office can report to the Data Management Organization or vice versa. When a Data Management program is focused on establishing policies and guidelines needed to manage data as an asset, the Data Governance Office can act as the lead, and the Data Management Organization reports to (or is matrixed to) the Data Governance Office.
6.3. Data Quality
6.3.1. Data Quality Management is a key capability of a data management practice and organization. Many Data Management Organizations start with a focus on the quality of data because there is a desire to measure and improve the quality of data across the organization.
6.4. Enterprise Architecture
6.4.1. Data Architecture is a key capability of an effective Data Management Organization.
6.5. Managing a Global Organization
6.5.1. As Data Management programs and Organizations become more global, the networked or federated models become more attractive where accountabilities can be aligned, standards can be followed, and regional variations can still be accommodated.
7. Data Management Roles
7.1. Organizational Roles
7.1.1. IT Data Management Organizations provide a range of services from data, application, and technical architecture to database administration. A
7.1.2. Business functions focused on data management are most often associated with Data Governance or Enterprise Information Management teams.
7.2. Individual Roles
7.2.1. Executive Roles
Chief Information Officer and Chief Technology Officer are well-established roles in IT.
The concept of Chief Data Officer on the business-side
7.2.2. Business Roles
Business roles focus largely on data governance functions, especially stewardship. Data Stewards are usually recognized subject matter experts who are assigned accountability for Metadata and data quality of business entities, subject areas, or databases.
7.2.3. IT Roles
IT Roles include different types of architects, developers at different levels, database administrators, and a range of supporting functions.
2. In the DAMA-DMBOK. data custodians,including architects,modelers and DBAs,are responsible for A:design, storage, archiving and backup of data B:preventing loss of data and/or corruption C:managing IT infrastructure data storage,disaster recovery and security D:possessing the data and controlling its use E:all 正确答案:A 你的答案:B 解析:16.7.2:1)数据架构师(Data Architect)。负责数据架构和数据集成的高级分析师。数据架构师可以在企业级或某个功能级别开展工作。数据架构师一般致力于数据仓库、数据集市及其相关的集成流程。4)数据库管理员(Database Administrator)。负责结构化数据资产的设计、实施和支持,以及提高数据访问性能的技术方法。
7.2.4. Hybrid Roles
Hybrid roles require a mix of business and technical knowledge. Depending on the organization, people in these roles may report through the IT or business side.
1. Data Development and Database Administration organizational responsibilities overlap in which of the following areas? A:Database optimization B:Training users C:Data architecture modeling D:Normalization规范化/denormalization逆规范化 E:none 正确答案:D 你的答案:B 解析:规范化影响性能,与两者有关
Data Quality Analyst
Metadata Specialist
Business Intelligence Architect
Business Intelligence Analyst / Administrator
Business Intelligence Program Manager
8. Works Cited / Recommended
Chapter 17: Data Management and Organizational Change Management 数据管理与组织变更管理
1. Introduction
1.1. For most organizations, improving data management practices requires changing how people work together and how they understand the role of data in their organizations, as well as the way they use data and deploy technology to support organizational processes. Successful data management practices require, among other factors:
1.1.1. Learning to manage on the horizontal by aligning accountabilities along the Information Value chain
1.1.2. Changing focus from vertical (silo 筒仓) accountability to shared stewardship of information
1.1.3. Evolving information quality from a niche business concern or the job of the IT department into a corevalue of the organization
1.1.4. Shifting thinking about information quality from ‘data cleansing and scorecards’ to a morefundamental organizational capability
1.1.5. Implementing processes to measure the cost of poor data management and the value of disciplined datamanagement
1.2. This level of change is not achieved through technology, even though appropriate use of software tools can support delivery.
1.3. It is instead achieved through a careful and structured approach to the management of change in the organization.
1.4. it is important to understand:
1.4.1. Why change fails
1.4.2. The triggers for effective change
1.4.3. The barriers to change
1.4.4. How people experience change
2. Laws of Change 变革法则
2.1. Organizations don’t change, people change: 组织不会变,是人在变
2.1.1. Change does not happen because a new organization isannounced or a new system is implemented. It takes place when people behave differently becausethey recognize the value in doing so. The process of improving data management practices andimplementing formal data governance will have far-reaching effects on an organization. People will beasked to change how they work with data and how they interact with each other on activities involvingdata.
2.2. People don’t resist change. They resist being changed: 人们不会抗拒变革,但抵制被改变
2.2.1. Individuals will not adopt change if they seeit as arbitrary or dictatorial. They are more likely to change if they have been engaged in defining thechange and if they understand the vision driving the change, as well as when and how change will takeplace. Part of change management for data initiatives involves working with teams to buildorganizational understanding of the value of improved data management practices.
2.3. Things are the way they are because they got that way: 事情之所以存在是惯性所致
2.3.1. There may be good historic reasons forthings being the way they are. At some point in the past, someone defined the business requirements,defined the process, designed the systems, wrote the policy, or defined the business model that nowrequires change. Understanding the origins of current data management practices will help theorganization avoid past mistakes. If staff members are given a voice in the change, they are more likelyto understand new initiatives as improvements.
2.4. Unless there is push to change, things will likely stay the same: 除非有人推动变革,否则很可能止步不前
2.4.1. If you want an improvement,something must be done differently. As Einstein famously said: “You can’t solve a problem with thelevel of thinking that created it in the first place.”
2.5. Change would be easy if it weren’t for all the people: 如果不考虑人的因素,变革将很容易
2.5.1. The ‘technology’ of change is often easy. The challenge comes in dealing with the natural variation that arises in people.
3. Not Managing a Change: Managing a Transition 并非管理变革:而是管理转型过程
3.1. Bridges’s Transition Phases 布里奇斯的变革转型阶段
3.1.1. Change management expert William Bridges emphasizes the centrality of transition in the change management process. He defines transition as the psychological process that people go through to come to terms with the new situation.

starting with the ending 结束 of the existing state
Endings are difficult because people need to let go of existing conditions
People then enter the Neutral Zone 相持阶段
in which the existing state has not quite ended and the new state has not quite begun
Change is complete when the new state 新开始 is established
3.1.2. He states: “Most organizations try to start with a beginning, rather than finishing with it. They pay no attention to endings. They do not acknowledge the existence of the neutral zone, and then wonder why people have so much difficulty with change” (Bridges, 2009).
2. The first step to overcome organizational resistance is to A:include all stakeholders B:get executive sponsorship C:develop respected champions D:acknowledge it E:all 正确答案:D 你的答案:B 解析: 17.3:布里奇斯认为,组织变革失败的一个最大原因是,推动变革的人很少思考结局,因此无法管理结局对人们的影响。他说:“大多数组织都试图从头开始,而非以终为始。他们忽视结果且不承认相持阶段存在,然后还困惑人们在改变时为何会有如此大的困难”
3.1.3. Bridges emphasizes that while the first task of the Change Manager is to understand the Destination (or VISION) and how to get there, the ultimate goal of transition management is to convince people that they need to start the journey.

1. The Ending
Help everyone to understand the current problems and why the change is necessary.
Identify who is likely to lose what. Remember that loss of friends and close workingcolleagues is as important to some as the loss of status and power is to others.
Losses are subjective. The things one person grieves about may mean nothing to someoneelse. Accept the importance of subjective losses. Don’t argue with others about how theyperceive the loss, and don’t be surprised at other people’s reactions to loss.
Expect and accept signs of grieving and acknowledge losses openly and sympathetically.
Define what is over and what is not. People must make the break at some time and trying tocling on to old ways prolongs difficulties.
Treat the past with respect. People have probably worked extremely hard in what may havebeen very difficult conditions. Recognize that and show that the work is valued.
Show how ending something ensures the things that matter to people are continued andimproved.
Give people information. Then do it again and again and again in a variety of ways – writteninformation to go away and read, as well as the opportunity to talk and ask questions.
Use the stakeholder analysis to map out how best to approach different individuals –understand how their perspectives might need to be engaged to initiate the change and whatlikely points of resistance might be.
2. The Neutral Zone
Recognize this as a difficult phase (mix of old and new) but that everyone must go through it.
Get people involved and working together; give them time and space to experiment and testnew ideas.
Help people to feel that they are still valued.
Praise people with good ideas, even if not every good idea works as expected. The Plan, Do,Study, Act (PDSA) model encourages trying things out, and learning from each cycle.
Give people information; do it again and again and again in a variety of ways.
Provide feedback about the results of the ideas being tested and decisions made.
3. The New Beginning
Do not force a beginning before its time.
Ensure people know what part they are to play in the new system.
Make sure policies, procedures, and priorities are clear; do not send mixed messages.
Plan to celebrate the new beginning and give the credit to those who have made the change.
Give people information; do it again and again in a variety of ways.
4. Kotter's Eight Errors of Change Management
4.1. Error #1: Allowing Too Much Complacency 过于自满
4.1.1. Change Agents often:
1. Overestimate their ability to force big changes on the organization
2. Underestimate how difficult it can be to shift people out of their comfort zones
3. Don’t see how their actions and approach might reinforce the status quo by driving up defensiveness
4. Rush in where angels fear to tread – kicking off change activities without sufficient communication ofwhat change is required or why change is required (the Vision)
5. Confuse urgency with anxiety, which in turn leads to fear and resistance as stakeholders retrench (oftenquite literally) in their silos
4.2. Error #2: Failing to Create a Sufficiently Powerful Guiding Coalition 未能建立足够强大的指导联盟
4.2.1. Kotter identifies that major change is almost impossible without the active support from the head of the organization and without a coalition of other leaders coming together to guide the change
4.3. Error #3: Underestimating the Power of Vision 低估愿景的力量
4.3.1. A Vision is a Clear and Compelling Statement of where the Change is leading.
4.4. Error #4: Under Communicating the Vision by a Factor of 10, 100, or 1000
4.4.1. Consistent, effective communication of the vision, followed by action, is critical to successful change management.
4.4.2. Kotter advises that communication comes in both words and deeds
4.5. Error #5: Permitting Obstacles to Block the Vision 允许阻挡愿景的障碍存在
4.5.1. the organization must identify and respond to different kinds of roadblocks:
Psychological: 心理障碍
Roadblocks that exist in people’s heads must be addressed based on their causes. Dothey stem from fear, lack of knowledge, or some other cause?
Structural: 组织结构
Roadblocks due to organizational structures such as narrow job categories or performanceappraisal systems that force people to choose between the Vision and their own self-interest must beaddressed as part of the change management process. Change management should address structuralincentives and disincentives to change.
Active resistance: 积极抵抗
What roadblocks exist due to people who refuse to adapt to the new set ofcircumstances and who make demands that are inconsistent with the Transformation? If key membersof the organization make the right noises about the change vision but fail to alter their behaviors orreward the required behaviors or continue to operate in incompatible ways, the execution of the visionwill falter and could fail.
4.6. Error #6: Failing to Create Short-Term Wins
4.6.1. Complex change efforts require short-term goals in support of long-term objectives.
4.7. Error #7: Declaring Victory Too Soon
4.7.1. Kotter suggests that changing an entire company can take between three and ten years.
4.8. Error # 8: Neglecting to Anchor Changes Firmly in the Corporate Culture
4.8.1. The two keys to anchoring the change in the culture of the organization are:
Consciously showing people how specific behaviors and attitudes have influenced performance.
Taking sufficient time to embed the change of approach in the next generation of management
5. Kotter's Eight Stage Process for Major Change

5.1. Establishing a Sense of Urgency 树立紧迫感
5.1.1. Sources of Complacency 自满的根源

5.1.2. Pushing up the Urgency Level
5.1.3. Using Crisis with Care 谨慎使用危机
5.1.4. The Role of Middle and Lower-level Managers
5.1.5. How Much Urgency is Enough?
5.2. The Guiding Coalition
5.2.1. The Importance of Effective Leadership in the Coalition
5.2.2. Building an Effective Team
5.2.3. Combating Group Think 避免群体思维
5.2.4. Common Goals
5.3. Developing a Vision and Strategy
5.3.1. Why Vision is Essential
A good vision shares three important purposes: Clarification, motivation, and alignment
5.3.2. The Nature of an Effective Vision
Imaginable: It conveys a picture of what the future looks like.
Desirable: It appeals to the long-term interests of employees, customers, shareholders, and otherstakeholders.
Feasible: It comprises realistic and attainable goals.
Focused: It is clear enough to provide guidance in decision-making.
Flexible: It is general enough to allow individuals to take the initiative and to allow for alternativeplans and responses when conditions or constraints change.
Communicable: It is easy to share and to communicate in five minutes or less.
5.3.3. Creating the Effective Vision

5.4. Communicating the Change Vision
5.4.1. Kotter identifies seven key elements in effective communication of vision:
1. Keep it simple: Strip out the jargon, internal vocabulary, and complex sentences.
2. Use metaphor, analogy, and example: A verbal picture (or even a graphical one) can be worth athousand words.
3. Use multiple forums: The message needs to be communicable across a variety of different forumsfrom elevator pitch to broadcast memo, from small meeting to an all-hands briefing.
4. Repeat, repeat, repeat: Ideas have to be heard many times before they are internalized andunderstood.
5. Lead by example: Behavior from important people needs to be consistent with the vision. Inconsistentbehavior overwhelms all other forms of communication.
6. Explain seeming inconsistencies: Loose ends and unaddressed disconnects undermine the credibilityof all communication.
7. Give and take: Two-way communication is always more powerful than one-way communication
5.4.2. Keeping it Simple
5.4.3. Use Many Different Forums
5.4.4. Repetition, Repetition, Repetition
5.4.5. Walking the Talk 言行一致
5.4.6. Explaining Inconsistencies
5.4.7. Listen and Be Listened To
6. The Formula for Change
6.1. One of the most famous methods for describing the ‘recipe’ required for effective change, the Gleicher Formula, describes factors that need to be in place to overcome the resistance to change in the organization
6.1.1. C=(D x V x F) > R
6.1.2. According to the Gleicher Formula, Change (C) occurs when the level of dissatisfaction with the status quo (D) is combined with a vision of a better alternative (V) and some actionable first steps to get there (F) and the product of the three is enticing enough to overcome resistance (R) in the organization.
7. Diffusion of Innovations and Sustaining Change
7.1. Diffusion of Innovations is a theory that seeks to explain how, why, and at what rate new ideas and technology spread through cultures. Formulated in 1962 by Everett Rogers

7.1.1. Innovators 创新者 2.5%
7.1.2. Early Adopters 早期使用者 13.5%
7.1.3. Early Majority 早期大众 34%
7.1.4. Late Majority 晚期大众 34%
7.1.5. Laggards 落伍者 16%
7.2. The Challenges to be Overcome as Innovations Spread
7.2.1. The first is breaking past the Early Adopter stage.
7.2.2. The second key challenge point is as the innovation moves out of the Late Majority stage into the Laggards stage. The team needs to accept that they cannot necessarily convert 100% of the population to the new way of doing things.
7.3. Key Elements in the Diffusion of Innovation
7.3.1. Four key elements need to be considered when looking at how an innovation spreads through an organization:
Innovation: An idea, practice, or object that is perceived as new by an individual or other unit ofadoption
Communication channels: The means by which messages get from one individual to another
Time: The speed at which the innovation is adopted by members of the social system
Social system: The set of interrelated units that are engaged in joint problem solving to accomplish acommon goal
7.4. The Five Stages of Adoption

7.5. Factors Affecting Acceptance or Rejection of an Innovation or Change
7.5.1. Trialability 可测试性 refers to how easy it is for the consumer to experiment with the new tool or technology
7.5.2. Observability 可观测性 is the extent that the innovation is visible. Making the innovation visible will drive communication about it through formal and personal networks.
8. Sustaining Change
8.1. Sense of Urgency / Dissatisfaction
8.1.1. It is important to maintain the sense of urgency. The corollary of this is to be alert to emerging areas of dissatisfaction in the organization and how the information management change might help support improvement.
8.2. Framing the Vision
8.2.1. A common mistake is to confuse project scope with change vision. Many projects may be required achieve the vision.
8.3. The Guiding Coalition
8.3.1. it is important not to confuse project steering groups who are overseeing the delivery of specific deliverables with the coalition who are guiding and evolving the vision for change in the organization.
8.4. Relative Advantage and Observability
8.4.1. While the specific application or focus of a change initiative might be narrow, in most cases the principles, practices, and tools that are applied may be transferrable to other initiatives
9. Communicating Data Management Value
9.1. Communications Principles
9.1.1. The purpose of any communication is to send a message to a receiver.
9.1.2. These are very important when communicating about data management because many people do not understand the importance of data management to organizational success. An overall communications plan and each individual communication should:
1. Have a clear objective and a desired outcome
2. Consist of key messages to support the desired outcome
3. Be tailored to the audience / stakeholders
4. Be delivered via media that are appropriate to the audience / stakeholders
9.1.3. Data management communications should strive to:
Convey the tangible and intangible value of data management initiatives
Describe how data management capabilities contribute to business strategy and results
Share concrete examples of how data management reduces costs, supports revenue growth, reducesrisk, or improves decision quality
Educate people on fundamental data management concepts to increase the base of knowledge aboutdata management within the organization
9.2. Audience Evaluation and Preparation 受众评估和准备
9.2.1. Communications planning should include a stakeholder analysis to help identify audiences for the communications that will be developed. Based on results of the analysis, content can be then tailored to be relevant, meaningful, and at the appropriate level, based on the stakeholder needs.
9.2.2. Tactics for persuading people to act on communications include various ways of getting people to see how their interests align with the goals of the program.
1. Solve problems: Messages should describe how the data management effort will help solve problemspertinent to the needs of the stakeholders being addressed. For example, individual contributors haveneeds different from executives. IT has needs that are different from those of business people.
2. Address pain points: Different stakeholders will have different pain points. Accounting for these painpoints in communications materials will help the audience understand the value of what is beingproposed. For example, a compliance stakeholder will be interested in how a Data Managementprogram will reduce risk. A marketing stakeholder will be interested in how the program helps themgenerate new opportunities.
3. Present changes as improvements: In most cases, introducing data management practices requiresthat people change how they work. Communications need to motivate people to desire the proposedchanges. In other words, they need to recognize changes as improvements from which they willbenefit.
4. Have a vision of success: Describing what it will be like to live in the future state enables stakeholdersto understand how the program impacts them. Sharing what success looks and feels like can help theaudience understand the benefits of the Data Management program.
5. Avoid jargon 避免专业术语: Data management jargon and an emphasis on technical aspects will turn some peopleoff and detract from the message.
6. Share stories and examples: Analogies and stories are effective ways to describe and help peopleremember the purposes of the Data Management program.
7. Recognize fear as motivation 变恐惧为动力: Some people are motivated by fear. Sharing the consequences of notmanaging data (e.g., fines, penalties) is a way to imply the value of managing data well. Examples ofhow the lack of data management practices has negatively affected a business unit will resonate.
9.3. The Human Element
9.3.1. The facts, examples, and stories shared about a Data Management program, are not the only things that will influence stakeholder perceptions about its value. People are influenced by their colleagues, and leaders. For this reason, communication should use the stakeholder analysis to find where groups have like interests and needs
9.4. Communication Plan

9.4.1. A communication plan brings planning elements together. A good plan serves as a roadmap to guide the work towards the goals.
9.5. Keep Communicating
9.5.1. Effective planning and ongoing communication will demonstrate the impact that data management practices have had on the organization over time.
9.5.2. As changes happen, communication plans need to be refreshed.
9.5.3. One goal of a communications plan is to remind stakeholders of the value and benefits of the Data Management program. Showing progress and celebrating successes is vital to gaining continued support for the effort.
10. Works Cited / Recommended
10.1. 1. What is the purpose of an information value chain artifact? A:Maps relationships between data and processes B:Provides a high-level orientation view of an enterprise and how it interacts with the outside world C:Shows a list of entities D:Maps the movement of data E:all 正确答案:A 你的答案:A 解析:1)根据信息价值链调整数据责任制度,以此来学习横向管理。
10.2. 3. The information value chain matrix shows data-to-value for all of the following EXCEPT A:information B:innovation C:knowledge D:wisdom E:all 正确答案:B 你的答案:B 解析:B没有呈现创新
10.3. 4. Growth by organizational change management is provided by all of the following EXCEPT good A:issues resolution process. B:flexible lT systems that can anticipate and react quickly C:communications plan D:scope and priorities management process E:all 正确答案:B 你的答案:B 解析: B是技术支持不是组织变革