花落无声: December 2009

Thursday, December 31, 2009

数据仓库建模_8

Common Matrix Mishaps

Departmental or overly encompassing rows.

Report-centric or too narrowly defined rows.

Overly generalized columns.

Separate columns for each level of a hierarchy.

Slowly Changing Dimensions
Type 1: Overwrite the Dimension Attribute

the original dimension, only the latest infor is saved.

Type 2: Add a New Dimension Row

the time period attribute is added.

Type 3: Add a New Dimension Attribute

add a version to attribute, user can choose any version of dimension. very useful when user want to switch the dimension.

Mini-Dimensions: Add a New Dimension

break a huge dimension into stable parts and frequently changing parts.

Hybrid Slowly Changing Dimension Techniques

try to adopt the advantage of above basic kinds of dimensions. Be care when use it.

Role-Playing Dimensions

A single dimension used many place of a model could be addressed as 'Role-Playing Dimensions'

Junk Dimensions

When the dimension cannot deal with legacy flag or text attribute. Try these

Leave them in the fact table.

but it's worst practice.

Make them into separate dimensions.

it may lead to the exploring of dimensions. RimBall says "we strive to have fewer than twenty foreign keys in a fact table for most industries and business processes. "

Eliminate them.

if it's meaningless, yield them.

Note Junk dimension is a highly technical term; try use some neuter like "invoice indicator".

Snowflaking and Outriggers

A snowflaking dimension is normalized dimension. seperated and then linked back .

An outrigger is useful if most of the dimension records share handful common informations.

three types of dimension

Transaction Fact Tables

Periodic Snapshot Fact Tables

Accumulating Snapshot Fact Tables

Factless Fact Tables

Monday, December 28, 2009

Application Architecture Document
• EXECUTIVE SUMMARY
o Business Understanding
o Project Focus
• METHODOLOGY
o Business Requirements
o High-Level Architecture Development
o Standards & Products
o Ongoing Refinement
• BUSINESS REQUIREMENTS AND ARCHITECTURAL IMPLICATIONS
o CRM (Campaign Management & Target Marketing)
o lSales Forecasting
o Inventory Planning
o Sales Performance
o Additional Business Issues
􀂃 Data Quality
􀂃 Common Data Elements and Business Definitions
• ARCHITECTURE OVERVIEW
o High Level Model
o Metadata Driven
o Flexible Services Layers
• MAJOR ARCHITECTURAL ELEMENTS
o Services and Functions
􀂃 ETL Services
􀂃 Customer Data Integration
􀂃 External Demographics
􀂃 Data Access Services
􀂃 BI Applications (CRM/Campaign Mgmt; Forecasting)
􀂃 Sales Management Dashboard
􀂃 Ad Hoc Query and Standard Reporting
􀂃 Metadata Maintenance
􀂃 User Maintained Data System
o Data Stores
􀂃 Sources and Reference Data
􀂃 ETL Data Staging and Data Quality Support
􀂃 Presentation Servers
􀂃 Business Metadata Repository
o Infrastructure and Utilities
o Metadata Strategy
• ARCHITECTURE DEVELOPMENT PROCESS
o Architecture Development Phases
o Architecture Proof of Concept
- 167 -
o Standards and Product Selection
o High Level Enterprise Bus Matrix
APPENDIX A—ARCHITECTURE MODELS

Wednesday, December 23, 2009

倦了

也不想认识谁,也不想联系了,太有挫败感了.

数据仓库建模_6

Extract

• Data profiling (1)
• Change data capture (2)
• Extract system (3)
Clean and Conform
There are five major services in the cleaning and conforming step:
• Data cleansing system (4)
• Error event tracking (5)
• Audit dimension creation (6)
• Deduplicating (7)
• Conforming (8)

Deliver
The delivery subsystems in the ETL back room consist of:
• Slowly changing dimension (SCD) manager (9)
• Surrogate key generator (10)
• Hierarchy manager (11)
• Special dimensions manager (12)
• Fact table builders (13)
• Surrogate key pipeline (14)
• Multi-valued bridge table builder (15)
• Late arriving data handler (16)
• Dimension manager system (17)
• Fact table provider system (18)
• Aggregate builder (19)
• OLAP cube builder (20)
• Data propagation manager (21)

19 and 20 could be done in cognos
ETL Management Services

• Job scheduler (22)
• Backup system (23)
• Recovery and restart (24)
• Version control (25)
• Version migration (26)
• Workflow monitor (27)
• Sorting (28)
• Lineage and dependency (29)
• Problem escalation (30)
• Paralleling and pipelining (31)
• Compliance manager (32)
• Security (33)
• Metadata repository (34)

PROCESS METADATA
?ETL operations statistics including start times, end times, CPU seconds used,
disk reads, disk writes, and row counts.
?Audit results including checksums and other measures of quality and
completeness.
?Quality screen results describing the error conditions, frequencies of
occurrence, and ETL system actions taken (if any) for all quality screening
findings.

TECHNICAL METADATA

• System inventory including version numbers describing all the software
required to assemble the complete ETL system.
• Source descriptions of all data sources, including record layouts, column
definitions, and business rules.
• Source access methods including rights, privileges, and legal limitations.
• ETL data store specifications and DDL scripts for all ETL tables, including
normalized schemas, dimensional schemas, aggregates, stand-alone relational
tables, persistent XML files, and flat files.
• ETL data store policies and procedures including retention, backup, archive,
recovery, ownership, and security settings.
• ETL job logic, extract and transforms including all data flow logic
embedded in the ETL tools, as well as the sources for all scripts and code
modules. These data flows define lineage and dependency relationships.
• Exception handling logic to determine what happens when a data quality
screen detects an error.
• Processing schedules that control ETL job sequencing and dependencies.
• Current maximum surrogate key values for all dimensions.
• Batch parameters that identify the current active source and target tables for
all ETL jobs.

BUSINESS METADATA
• Data quality screen specifications including the code for data quality tests,
severity score of the potential error, and action to be taken when the error
occurs.
• Data dictionary describing the business content of all columns and tables
across the data warehouse.
• Logical data map showing the overall data flow from source tables and fields
through the ETL system to target tables and columns.
• Business rule logic describing all business rules that are either explicitly
checked or implemented in the data warehouse, including slowly changing
dimension policies and null handling.

We call this usage based optimization

The solution for huge dataset

Typical adjustments include partitioning the
warehouse onto multiple servers, either vertically, horizontally, or both. Vertical
partitioning means breaking up the components of the presentation server architecture
into separate platforms, typically running on separate servers. In this case you could
have a server for the atomic level data, a server for the aggregate data (which may
also include atomic level data for performance reasons), and a server for aggregate
management and navigation. Often this last server has its own caching capabilities,
acting as an additional data layer. You may also have separate servers for background
ETL processing.
Horizontal partitioning means distributing the load based on datasets. In this case,
you may have separate presentation servers (or sets of vertically partitioned servers)
dedicated to hosting specific business process dimensional models. For example, you
may put your two largest datasets on two separate servers, each of which holds atomic
level and aggregate data. You will need functionality somewhere between the user
and data to support analyses that query data from both business processes.

Presentation Server Metadata

PROCESS METADATA
• Database monitoring system tables containing information about the use of
tables throughout the presentation server.
• Aggregate usage statistics including OLAP usage.

TECHNICAL METADATA
• Database system tables containing standard RDBMS table, column, view,
index, and security information.
• Partition settings including partition definitions and logic for managing them
over time.
• Stored procedures and SQL scripts for creating partitions, indexes, and
aggregates, as well as security management.
• Aggregate definitions containing the definitions of system entities such as
materialized views, as well as other information necessary for the query rewrite
facility of the aggregate navigator.
• OLAP system definitions containing system information specific to OLAP
databases.
• Target data policies and procedures including retention, backup, archive,
recovery, ownership, and security settings.

The most important BI application types include the following:

• Direct access queries: the classic ad hoc requests initiated by business users
from desktop query tool applications.
• Standard reports: regularly scheduled reports typically delivered via the BI
portal or as spreadsheets or PDFs to an online library.
• Analytic applications: applications containing powerful analysis algorithms
in addition to normal database queries. Pre-built analytic applications
packages include budgeting, forecasting, and business activity monitoring
(BAM).

• Dashboards and scorecards: multi-subject user interfaces showing key
performance indicators (KPIs) textually and graphically.
• Data mining and models: exploratory analysis of large "observation sets"
usually downloaded from the data warehouse to data mining software. Data
mining is also used to create the underlying models used by some analytic and
operational BI applications.
• Operational BI: real time or near real time queries of operational status, often
accompanied by transaction write-back interfaces.

164

Tuesday, December 15, 2009

when you get old, you will know which ones could not be friends.

昨天晚上她说五差,背景差,长相差,学历差,收入差,理想差.还说我只能找到一个收入两三千的打工妹.
虽然从这些来看并不是市侩的人,虽然她这些日子很难过,但是作为朋友如果得不到尊重,被看不起,是很难延续下去.虽然我应该大度一些,照顾别人,但是人家还觉得受了这个照顾很憋屈.又何必自找没趣呢?
也许已经习惯了有谁的生活,但是如果不值得(各个层面)的话,又何必延续下去呢.

Thursday, December 10, 2009

探索的动机（爱因斯坦在普朗克生日会上的讲话） : 弯曲评论(zz)

zz
http://www.tektalk.cn/2009/12/06/探索的动机（爱因斯坦在普朗克生日会上的讲话）/
附录：探索的动机（爱因斯坦在普朗克生日会上的讲话）

在科学的庙堂里有许多房舍，住在里面的人真是各式各样，而引导他们到那里去的动机也实在各不相同。有许多人所以爱好科学，是因为科学给他们以超乎常人的智力上的快感，科学是他们自己的特殊娱乐，他们在这种娱乐中寻求生动活泼的经验和对他们自己雄心壮志的满足；在这座庙堂里，另外还有许多人所以把他们的脑力产物奉献在祭坛上，为的是纯粹功利的目的。如果上帝有位天使跑来把所有属于这两类的人都赶出庙堂，那末聚集在那里的人就会大大减少，但是，仍然还有一些人留在里面，其中有古人，也有今人。我们的普朗克就是其中之一，这也就是我们所以爱戴他的原因。

我很明白，我们刚才在想象随便驱逐可许多卓越的人物，他们对建筑科学庙堂有过很大的也许是主要的贡献；在许多情况下，我们的天使也会觉得难于作出决定。但有一点我可以肯定，如果庙堂里只有被驱逐的那两类人，那末这座庙堂决不会存在，正如只有蔓草就不成其为森林一样。因为，对于这些人来说，只要有机会，人类活动的任何领域都会去干；他们究竟成为工程师、官吏、商人还是科学家，完全取决于环境。现在让我们再来看看那些为天使所宠爱的人吧。

他们大多数是相当怪癖、沉默寡言和孤独的人，但尽管有这些共同特点，实际上他们彼此之间很不一样，不象被赶走的那许多人那样彼此相似。究竟是什么把他们引到这座庙堂里来的呢?这是一个难题，不能笼统地用一句话来回答。首先我同意叔本华(Schopenhauer)所说的，把人们引向艺术和科学的最强烈的动机之一，是要逃避日常生活中令人厌恶的粗俗和使人绝望的沉闷，是要摆脱人们自己反复无常的欲望的桎梏。一个修养有素的人总是渴望逃避个人生活而进入客观知觉和思维的世界；这种愿望好比城市里的人渴望逃避喧嚣拥挤的环境，而到高山上去享受幽静的生活，在那里透过清寂而纯洁的空气，可以自由地眺望，陶醉于那似乎是为永恒而设计的宁静景色。

除了这种消极的动机以外，还有一种积极的动机。人们总想以最适当的方式画出一幅简化的和易领悟的世界图像；于是他就试图用他的这种世界体系(cosmos)来代替经验的世界，并来征服它。这就是画家、诗人、思辨哲学家和自然科学家所做的，他们都按自己的方式去做。各人把世界体系及其构成作为他的感情生活的支点，以便由此找到他在个人经验的狭小范围理所不能找到的宁静和安定。

理论物理学家的世界图像在所有这些可能的图像中占有什么地位呢？它在描述各种关系时要求尽可能达到最高的标准的严格精密性，这样的标准只有用数学语言才能达到。另一方面，物理学家对于他的主题必须极其严格地加以控制：他必须满足于描述我们的经验领域里的最简单事件。企图以理论物理学家所要求的精密性和逻辑上的完备性来重现一切比较复杂的事件，这不是人类智力所能及的。高度的纯粹性、明晰性和确定性要以完整性为代价。但是当人们畏缩而胆怯地不去管一切不可捉摸和比较复杂的东西时，那末能吸引我们去认识自然界的这一渺小部分的究竟又是什么呢？难道这种谨小慎微的努力结果也够得上宇宙理论的美名吗？

我认为，是够得上的；因为，作为理论物理学结构基础的普遍定律，应当对任何自然现象都有效。有了它们，就有可能借助于单纯的演绎得出一切自然过程(包括生命)的描述，也就是说得出关于这些过程的理论，只要这种演绎过程并不太多地超出人类理智能力。因此，物理学家放弃他的世界体系的完整性，倒不是一个什么根本原则性的问题。

物理学家的最高使命是要得到那些普遍的基本定律，由此世界体系就能用单纯的演绎法建立起来。要通向这些定律，没有逻辑的道路，只有通过那种以对经验的共鸣的理解为依据的直觉，才能得到这些定律。由于有这种方法论上的不确定性，人们可以假定，会有许多个同样站得住脚的理论物理体系；这个看法在理论上无疑是正确的。但是，物理学的发展表明，在某一时期，在所有可想到的构造中，总有一个显得别的都高明得多。凡是真正深入研究过这问题的人，都不会否认唯一地决定理论体系的，实际上是现象世界，尽管在现象和它们的理论原理之间并没有逻辑的桥梁；这就是莱布尼兹(Leibnitz)非常中肯地表述过的"先定的和谐"。物理学家往往责备研究认识论者没有给予足够的注意。我认为，几年前马赫和普朗克之间所进行的论战的根源就在于此。

渴望看到这种先定的和谐，是无穷的毅力和耐心的源泉。我们看到，普朗克就是因此而专心致志于这门科学中的最普遍的问题，而不是使自己分心于比较愉快的和容易达到的目标上去。我常常听到同事们试图把他的这种态度归因于非凡的意志力和修养，但我认为这是错误的。促使人们去做这种工作的精神状态是同信仰宗教的人或谈恋爱的人的精神状态相类似的；他们每天的努力并非来自深思熟虑的意向或计划，而是直接来自激情。我们敬爱的普朗克就坐在这里，内心在笑我像孩子一样提着第欧根尼的灯笼闹着玩。我们对他的爱戴不需要作老生常谈的说明。祝愿他对科学的热爱继续照亮他未来的道路，并引导他去解决今天物理学的最重要的问题。这问题是他自己提出来的，并且为了解决这问题他已经做了很多工作。祝他成功地把量子论同电动力学、力学统一于一个单一的逻辑体系里。

Friday, December 04, 2009

在汇款小票丢失以后2

上次写博客是在上周三的时候,周四下午过去,等了四十分钟以后,发现负责这件事情的人没有来.去找大堂经理,一开始她说不在她也没有办法,我就说"我不管你们的谁在不在,我是来找你们银行办事的",然后该大堂经理就带我去找一个比较高级的负责人,李副主任,然后交流下,然后说让我在外面等,等了约40分钟以后,该经理出来说没有找到,让我留个电话回去等.也没有办法了.

这周二的时候,李主任打电话过来说找到了,问我还要不要,我当然说要了,然后就传真给我,这件事情就算arrive milestone了.

其实周四去的时候我还是比较有信心的,无奈那个人不在,然后回来就担心可能对方就把这件事情给忘了,或者是懒得弄了,反正我也忘了拿他们的联系方式,找不了他们.没有想到这人还给我办成了.

下一步的计划就是用这些东西去换收据了.忙啊.

明天应该是生日,周日应该需要同学小聚.

Thursday, December 03, 2009

数据仓库建模_5

Definition of technical architecture

The technical architecture is the overall plan for what you want the DW/BI system to be when it's ready for serious use .
It describes the flow of data from the source systems to the decision makers and the transformations and data stores that data goes through along the way. It also specifies the tools, techniques, utilities, and platforms needed to make that flow happen.

We think about metadata as all the information that defines and describes the structures, operations, and contents of the DW/BI system.The DW/BI industry often refers to two main categories of metadata: technical and business.

Technical metadata defines the objects and processes that make up the DW/BI system from a technical perspective. This includes the system metadata that defines the data structures themselves, like tables, fields, data types, indexes, and partitions in the relational engine, and databases, dimensions, measures, and data mining models. In the ETL process, technical metadata defines the sources and targets for a particular task, the transformations (including business rules and data quality screens), and their frequency.

Process metadata describes the results of various operations in the warehouse. In the ETL process, each task logs key data about its execution, such as start time, end time, CPU seconds used, disk reads, disk writes, and rows processed. Similar process metadata is generated when users query the warehouse. This data is initially valuable for troubleshooting the ETL or query process.

Wednesday, December 02, 2009

Note of use delete in kettle

Since in the "delete" the "=" condition cannot deal with null = null, So if some columns in the comparison are null,the "delete" will be unable to deal with it.

花落无声