Sunday, January 24, 2010
Intresting
Thursday, December 31, 2009
数据仓库建模_8
Type 1: Overwrite the Dimension Attribute
Monday, December 28, 2009
数据仓库建模_7
• EXECUTIVE SUMMARY
o Business Understanding
o Project Focus
• METHODOLOGY
o Business Requirements
o High-Level Architecture Development
o Standards & Products
o Ongoing Refinement
• BUSINESS REQUIREMENTS AND ARCHITECTURAL IMPLICATIONS
o CRM (Campaign Management & Target Marketing)
o lSales Forecasting
o Inventory Planning
o Sales Performance
o Additional Business Issues
Data Quality
Common Data Elements and Business Definitions
• ARCHITECTURE OVERVIEW
o High Level Model
o Metadata Driven
o Flexible Services Layers
• MAJOR ARCHITECTURAL ELEMENTS
o Services and Functions
ETL Services
Customer Data Integration
External Demographics
Data Access Services
BI Applications (CRM/Campaign Mgmt; Forecasting)
Sales Management Dashboard
Ad Hoc Query and Standard Reporting
Metadata Maintenance
User Maintained Data System
o Data Stores
Sources and Reference Data
ETL Data Staging and Data Quality Support
Presentation Servers
Business Metadata Repository
o Infrastructure and Utilities
o Metadata Strategy
• ARCHITECTURE DEVELOPMENT PROCESS
o Architecture Development Phases
o Architecture Proof of Concept
- 167 -
o Standards and Product Selection
o High Level Enterprise Bus Matrix
APPENDIX A—ARCHITECTURE MODELS
Wednesday, December 23, 2009
数据仓库建模_6
• Data profiling (1)
• Change data capture (2)
• Extract system (3)
Clean and Conform
There are five major services in the cleaning and conforming step:
• Data cleansing system (4)
• Error event tracking (5)
• Audit dimension creation (6)
• Deduplicating (7)
• Conforming (8)
Deliver
The delivery subsystems in the ETL back room consist of:
• Slowly changing dimension (SCD) manager (9)
• Surrogate key generator (10)
• Hierarchy manager (11)
• Special dimensions manager (12)
• Fact table builders (13)
• Surrogate key pipeline (14)
• Multi-valued bridge table builder (15)
• Late arriving data handler (16)
• Dimension manager system (17)
• Fact table provider system (18)
• Aggregate builder (19)
• OLAP cube builder (20)
• Data propagation manager (21)
ETL Management Services
• Job scheduler (22)
• Backup system (23)
• Recovery and restart (24)
• Version control (25)
• Version migration (26)
• Workflow monitor (27)
• Sorting (28)
• Lineage and dependency (29)
• Problem escalation (30)
• Paralleling and pipelining (31)
• Compliance manager (32)
• Security (33)
• Metadata repository (34)
?ETL operations statistics including start times, end times, CPU seconds used,
disk reads, disk writes, and row counts.
?Audit results including checksums and other measures of quality and
completeness.
?Quality screen results describing the error conditions, frequencies of
occurrence, and ETL system actions taken (if any) for all quality screening
findings.
TECHNICAL METADATA
• System inventory including version numbers describing all the software
required to assemble the complete ETL system.
• Source descriptions of all data sources, including record layouts, column
definitions, and business rules.
• Source access methods including rights, privileges, and legal limitations.
• ETL data store specifications and DDL scripts for all ETL tables, including
normalized schemas, dimensional schemas, aggregates, stand-alone relational
tables, persistent XML files, and flat files.
• ETL data store policies and procedures including retention, backup, archive,
recovery, ownership, and security settings.
• ETL job logic, extract and transforms including all data flow logic
embedded in the ETL tools, as well as the sources for all scripts and code
modules. These data flows define lineage and dependency relationships.
• Exception handling logic to determine what happens when a data quality
screen detects an error.
• Processing schedules that control ETL job sequencing and dependencies.
• Current maximum surrogate key values for all dimensions.
• Batch parameters that identify the current active source and target tables for
all ETL jobs.
• Data quality screen specifications including the code for data quality tests,
severity score of the potential error, and action to be taken when the error
occurs.
• Data dictionary describing the business content of all columns and tables
across the data warehouse.
• Logical data map showing the overall data flow from source tables and fields
through the ETL system to target tables and columns.
• Business rule logic describing all business rules that are either explicitly
checked or implemented in the data warehouse, including slowly changing
dimension policies and null handling.
warehouse onto multiple servers, either vertically, horizontally, or both. Vertical
partitioning means breaking up the components of the presentation server architecture
into separate platforms, typically running on separate servers. In this case you could
have a server for the atomic level data, a server for the aggregate data (which may
also include atomic level data for performance reasons), and a server for aggregate
management and navigation. Often this last server has its own caching capabilities,
acting as an additional data layer. You may also have separate servers for background
ETL processing.
Horizontal partitioning means distributing the load based on datasets. In this case,
you may have separate presentation servers (or sets of vertically partitioned servers)
dedicated to hosting specific business process dimensional models. For example, you
may put your two largest datasets on two separate servers, each of which holds atomic
level and aggregate data. You will need functionality somewhere between the user
and data to support analyses that query data from both business processes.
• Database monitoring system tables containing information about the use of
tables throughout the presentation server.
• Aggregate usage statistics including OLAP usage.
• Database system tables containing standard RDBMS table, column, view,
index, and security information.
• Partition settings including partition definitions and logic for managing them
over time.
• Stored procedures and SQL scripts for creating partitions, indexes, and
aggregates, as well as security management.
• Aggregate definitions containing the definitions of system entities such as
materialized views, as well as other information necessary for the query rewrite
facility of the aggregate navigator.
• OLAP system definitions containing system information specific to OLAP
databases.
• Target data policies and procedures including retention, backup, archive,
recovery, ownership, and security settings.
• Direct access queries: the classic ad hoc requests initiated by business users
from desktop query tool applications.
• Standard reports: regularly scheduled reports typically delivered via the BI
portal or as spreadsheets or PDFs to an online library.
• Analytic applications: applications containing powerful analysis algorithms
in addition to normal database queries. Pre-built analytic applications
packages include budgeting, forecasting, and business activity monitoring
(BAM).
performance indicators (KPIs) textually and graphically.
• Data mining and models: exploratory analysis of large "observation sets"
usually downloaded from the data warehouse to data mining software. Data
mining is also used to create the underlying models used by some analytic and
operational BI applications.
• Operational BI: real time or near real time queries of operational status, often
accompanied by transaction write-back interfaces.
Tuesday, December 15, 2009
when you get old, you will know which ones could not be friends.
虽然从这些来看并不是市侩的人,虽然她这些日子很难过,但是作为朋友如果得不到尊重,被看不起,是很难延续下去.虽然我应该大度一些,照顾别人,但是人家还觉得受了这个照顾很憋屈.又何必自找没趣呢?
也许已经习惯了有谁的生活,但是如果不值得(各个层面)的话,又何必延续下去呢.
Thursday, December 10, 2009
探索的动机(爱因斯坦在普朗克生日会上的讲话) : 弯曲评论(zz)
http://www.tektalk.cn/2009/12/06/探索的动机(爱因斯坦在普朗克生日会上的讲话)/
附录:探索的动机(爱因斯坦在普朗克生日会上的讲话)
在科学的庙堂里有许多房舍,住在里面的人真是各式各样,而引导他们到那里去的动机也实在各不相同。有许多人所以爱好科学,是因为科学给他们以超乎常人的智力上的快感,科学是他们自己的特殊娱乐,他们在这种娱乐中寻求生动活泼的经验和对他们自己雄心壮志的满足;在这座庙堂里,另外还有许多人所以把他们的脑力产物奉献在祭坛上,为的是纯粹功利的目的。如果上帝有位天使跑来把所有属于这两类的人都赶出庙堂,那末聚集在那里的人就会大大减少,但是,仍然还有一些人留在里面,其中有古人,也有今人。我们的普朗克就是其中之一,这也就是我们所以爱戴他的原因。
我很明白,我们刚才在想象随便驱逐可许多卓越的人物,他们对建筑科学庙堂有过很大的也许是主要的贡献;在许多情况下,我们的天使也会觉得难于作出决定。但有一点我可以肯定,如果庙堂里只有被驱逐的那两类人,那末这座庙堂决不会存在,正如只有蔓草就不成其为森林一样。因为,对于这些人来说,只要有机会,人类活动的任何领域都会去干;他们究竟成为工程师、官吏、商人还是科学家,完全取决于环境。现在让我们再来看看那些为天使所宠爱的人吧。
他们大多数是相当怪癖、沉默寡言和孤独的人,但尽管有这些共同特点,实际上他们彼此之间很不一样,不象被赶走的那许多人那样彼此相似。究竟是什么把他们引到这座庙堂里来的呢?这是一个难题,不能笼统地用一句话来回答。首先我同意叔本华(Schopenhauer)所说的,把人们引向艺术和科学的最强烈的动机之一,是要逃避日常生活中令人厌恶的粗俗和使人绝望的沉闷,是要摆脱人们自己反复无常的欲望的桎梏。一个修养有素的人总是渴望逃避个人生活而进入客观知觉和思维的世界;这种愿望好比城市里的人渴望逃避喧嚣拥挤的环境,而到高山上去享受幽静的生活,在那里透过清寂而纯洁的空气,可以自由地眺望,陶醉于那似乎是为永恒而设计的宁静景色。
除了这种消极的动机以外,还有一种积极的动机。人们总想以最适当的方式画出一幅简化的和易领悟的世界图像;于是他就试图用他的这种世界体系(cosmos)来代替经验的世界,并来征服它。这就是画家、诗人、思辨哲学家和自然科学家所做的,他们都按自己的方式去做。各人把世界体系及其构成作为他的感情生活的支点,以便由此找到他在个人经验的狭小范围理所不能找到的宁静和安定。
理论物理学家的世界图像在所有这些可能的图像中占有什么地位呢?它在描述各种关系时要求尽可能达到最高的标准的严格精密性,这样的标准只有用数学语言才能达到。另一方面,物理学家对于他的主题必须极其严格地加以控制:他必须满足于描述我们的经验领域里的最简单事件。企图以理论物理学家所要求的精密性和逻辑上的完备性来重现一切比较复杂的事件,这不是人类智力所能及的。高度的纯粹性、明晰性和确定性要以完整性为代价。但是当人们畏缩而胆怯地不去管一切不可捉摸和比较复杂的东西时,那末能吸引我们去认识自然界的这一渺小部分的究竟又是什么呢?难道这种谨小慎微的努力结果也够得上宇宙理论的美名吗?
我认为,是够得上的;因为,作为理论物理学结构基础的普遍定律,应当对任何自然现象都有效。有了它们,就有可能借助于单纯的演绎得出一切自然过程(包括生命)的描述,也就是说得出关于这些过程的理论,只要这种演绎过程并不太多地超出人类理智能力。因此,物理学家放弃他的世界体系的完整性,倒不是一个什么根本原则性的问题。
物理学家的最高使命是要得到那些普遍的基本定律,由此世界体系就能用单纯的演绎法建立起来。要通向这些定律,没有逻辑的道路,只有通过那种以对经验的共鸣的理解为依据的直觉,才能得到这些定律。由于有这种方法论上的不确定性,人们可以假定,会有许多个同样站得住脚的理论物理体系;这个看法在理论上无疑是正确的。但是,物理学的发展表明,在某一时期,在所有可想到的构造中,总有一个显得别的都高明得多。凡是真正深入研究过这问题的人,都不会否认唯一地决定理论体系的,实际上是现象世界,尽管在现象和它们的理论原理之间并没有逻辑的桥梁;这就是莱布尼兹(Leibnitz)非常中肯地表述过的"先定的和谐"。物理学家往往责备研究认识论者没有给予足够的注意。我认为,几年前马赫和普朗克之间所进行的论战的根源就在于此。
渴望看到这种先定的和谐,是无穷的毅力和耐心的源泉。我们看到,普朗克就是因此而专心致志于这门科学中的最普遍的问题,而不是使自己分心于比较愉快的和容易达到的目标上去。我常常听到同事们试图把他的这种态度归因于非凡的意志力和修养,但我认为这是错误的。促使人们去做这种工作的精神状态是同信仰宗教的人或谈恋爱的人的精神状态相类似的;他们每天的努力并非来自深思熟虑的意向或计划,而是直接来自激情。我们敬爱的普朗克就坐在这里,内心在笑我像孩子一样提着第欧根尼的灯笼闹着玩。我们对他的爱戴不需要作老生常谈的说明。祝愿他对科学的热爱继续照亮他未来的道路,并引导他去解决今天物理学的最重要的问题。这问题是他自己提出来的,并且为了解决这问题他已经做了很多工作。祝他成功地把量子论同电动力学、力学统一于一个单一的逻辑体系里。
Friday, December 04, 2009
在汇款小票丢失以后2
Thursday, December 03, 2009
数据仓库建模_5
It describes the flow of data from the source systems to the decision makers and the transformations and data stores that data goes through along the way. It also specifies the tools, techniques, utilities, and platforms needed to make that flow happen.
Technical metadata defines the objects and processes that make up the DW/BI system from a technical perspective. This includes the system metadata that defines the data structures themselves, like tables, fields, data types, indexes, and partitions in the relational engine, and databases, dimensions, measures, and data mining models. In the ETL process, technical metadata defines the sources and targets for a particular task, the transformations (including business rules and data quality screens), and their frequency.
Wednesday, December 02, 2009
Note of use delete in kettle
Saturday, November 28, 2009
Wednesday, November 25, 2009
在汇款小票丢失以后.
首先得明确这个汇款小票是无法补开的,但是可以补开一个凭证,其实是一个凭证的复印件.
需要的信息,1,汇款行2,日期(到天就可以了)3,汇款对象的信息,帐号,账户名,开户行4金额5柜员号.
可能对于大家来说柜员号的话是很难得知的,不要紧.如果你记得是那一个窗口,可以找大堂经理查询,如果不记得了也没有关系,可以问当日所有的柜员号信息.(银行内部其实是可以根据对方账户信息查到的,我就是这样查到的,但应该不符合规范)
一,找到这笔业务归属的支行网点,这个可以问大堂经理知道.
二,找到该支行的大堂经理,向他说明情况,他们银行应该是提供这种业务的,然后他会带你去开这个复印件.
三,找到这个人以后,把上面的信息给他,他就可以给查了,但是这个查询是要付费的(我明天才去,还不知道多少钱),如果无法确定是哪一个柜员号,至少把当天的当班的所有柜员号给他,这样可以缩小范围,而且,这种情况下,由于业务量比较大,可能需要说说好话.
这样应该就可以搞定了.
Note,好像现在大堂经理都还比较负责,可能是比较害怕投诉吧.当然最好还是不要撕破脸皮,这样他可以提供规章之外的一些帮助.否则按规章的话是很难办好这件事情的.
尽量让对方多解释,这样的话有时候对方可以提供一些有用的信息,比如我去咨询的时候大堂经理就说如果提供窗口号可以知道是那一个柜员,而窗口是不会变的,这样我们就可以推导出他们是可以提供当天所有的柜员号.
嗯,当然我的例子还只是一个半成功的例子.明天上午或者下午就去执行第三步了.
To be continue.
Sunday, November 22, 2009
H1N1 continue.
so called bronchitis. Lucky but not most lucky of me .
Sunday, November 15, 2009
微软拼音输入法引起的蓝屏问题
I believe the problem is no longer a pain to you, Wish it helps.
Monday, November 09, 2009
数据仓库建模_4
return on investment (ROI)
net present value (NPV) or internal rate of return (IRR)
The major roles involved in the DW/BI implementation front office, coaches, regular lineup, and special teams
influence on the rest of the team.
Some team members are assigned to the project full time; others are involved on a
part-time or sporadic basis.
Project managers need strong
organizational and leadership skills to keep everyone involved moving in the same direction.
The business analyst is responsible for leading the business requirements definition activities and then representing those requirements as the technical architecture,
This person is not responsible for inputting all the metadata, but rather is a watchdog to ensure that everyone is contributing their relevant piece,
Sunday, November 08, 2009
数据仓库建模_3
划定范围和调整是同时进行的.
Note Project scope should be driven by the business's requirements.
范围是由业务需求决定的.
Each early iteration of your DW/BI program should be limited in scope to
the data resulting from a single business
process.起初的时候不能简单问题复杂化.In other words, start small
Given the lengthy project timeframes associated with complex software
development projects combined with the realities of relatively high rates
of under-delivery, there's
been much interest in the rapid application development movement, often
measured in weeks. 快速开发(通常以周记)
In fact, we suggest that BI team members work in close proximity to the
business so they're readily available and responsive;
开发人员和业务人员更亲近,业务为导向的.
Some development teams have naturally fallen into the trap of creating
analytic or reporting solutions in a vacuum. In most of these situations,
the team worked with a small set of users to extract a limited set of
source data and make it available to solve their unique problems. The
outcome is often a stand-alone data stovepipe that can't be leveraged by
others or worse yet, delivers data that doesn't tie to the organization's
other analytic information.
做分析做成了报表,这不正是我们组现在的问题所在么?因为孤立的获取需求,所以做出来的东西也一个个的孤立了,不能互相使用.
project charter, the document explains the project's focus and motivating
business requirements, objectives, approach,anticipated data and target
users, involved parties and stakeholders, success criteria,assumptions and
risks. It may also be appropriate to explicitly list data and analyses
关注点,需求,客观现状,途径,预计中的数据和目标用户,相关的组织\股东,成功标准(验收标准),前提和风险
P38
Thursday, November 05, 2009
weired way to get return value
script, try return in the command field and then return $? in the
subroutine field. as what following shows .
function isRunnable()
{
typ=$1;
filename=$2;
day_a=$3;
day=${day_a:=''}
cat 'exe_seq_all.dat' | grep "$filename$day$" | while read line
do
arr=($line);
prework=${arr[0]};
if [ ! `grep -c " $prework " 'fin.dat' ` = 0 ]
then
return 0;
fi;
return 1;
done
return $?;
Tuesday, November 03, 2009
recover from rm *
* in the cygwin command line. all the work I've done in the afternoon is
gone, Much upset when I learn there is no 'recycle bin' in cygwin.
lucky, after some search on the net , I found this 'File Scavenger' which
help me to restore the *.sh *.dat *.pl *~ from the disk. Thank Que Tek
Consulting.co