The
Relational Model
The database development process we are following these steps:
v
Gather user/business requirements.
v
Develop the conceptual E-R Model (shown as an
E-R Diagram) based on the user/business requirements.
v
Convert the E-R Model to a set of relations in
the (logical) relational model
v
Normalize the relations to remove any anomalies.
v
Implement the database by creating a table for
each normalized relation in a relational database management system.
What
is Normalization?
v
Normalization is a process in which we
systematically examine relations for anomalies and, when detected, remove those
anomalies by splitting up the relation into two new, related, relations.
v
Normalization is an important part of the
database development process: Often during normalization, the database
designers get their first real look into how the data are going to interact in
the database.
v
Finding problems with the database structure at
this stage is strongly preferred to finding problems further along in the
development process because at this point it is fairly easy to cycle back to
the conceptual model (Entity Relationship model) and make changes.
v
Normalization can also be thought of as a
trade-off between data redundancy and performance. Normalizing a relation
reduces data redundancy but introduces the need for joins when all of the data
is required by an application such as a report query.
Recall, the Relational Model consists of the elements: relations, which
are made up of attributes.
v
A relation is a set of attributes with values
for each attribute such that:
v
Each attribute (column) value must be a single
value only.
v
All values for a given attribute (column ) must
be of the same data type.
v
Each attribute (column) name must be unique.
v
The order of attributes (columns) is
insignificant
v
No two tuples (rows) in a relation can be
identical.
v
The order of the tuples (rows) is insignificant.
v
From our discussion of E-R Modeling, we know
that an Entity typically corresponds to a relation and that the Entity’s
attributes become attributes of the relation.
v
We also discussed how, depending on the
relationships between entities, copies of attributes (the identifiers ) were
placed in related relations as foreign keys.
Functional
Dependencies
v
A Functional Dependency describes a relationship
between attributes within a single relation.
v
An attribute is functionally dependent on
another if we can use the value of one attribute to determine the value of another.
Example: Employee_Name is functionally dependent on
Social_Security_Number because Social_Security_Number can be used to uniquely
determine the value of Employee_Name.
We use the arrow symbol → to indicate a functional dependency.
X → Y is read X functionally determines Y
Here are a few more examples:
Student_ID → Student_Major
Student_ID, CourseNumber, Semester → Grade
Course_Number, Section → Professor, Classroom, NumberOfStudents
SKU → Compact_Disk_Title, Artist
CarModel, Options, TaxRate → Car_Price
The attributes listed on the left hand side of the → are called
determinants.
One can read A → B as, “A determines B”. Or more specifically: Given a
value for A, we can uniquely determine one value for B.
Keys
and Uniqueness
v
Key: One or more attributes that uniquely
identify a tuple (row) in a relation.
v
The selection of keys will depend on the
particular application being considered.
v
In most cases the key for a relation will
already be specified during the conversion from the E-R model to a set of
relations.
v
Users can also offer some guidance as to what
would make an appropriate key.
v
Recall that no two relations should have exactly
the same values, thus a candidate key would consist of all of the attributes in
a relation.
A key functionally determines a tuple (row). So one functional
dependency that can always be written is:
The Key → All other
attributes
Not all determinants are keys
Modification
Anomalies
v
Once our E-R model has been converted into
relations, we may find that some relations are not properly specified. There
can be a number of problems:
v
Deletion Anomaly: Deleting one fact or data
point from a relation results in other information being lost.
v
Insertion Anomaly: Inserting a new fact or tuple
into a relation requires we have information from two or more entities – this
situation might not be feasible.
v
Update Anomaly: Updating one fact in a relation
requires us to update multiple tuples.
v
Here is a quick example to illustrate these
anomalies: A company has a Purchase Order form:
Our dutiful consultant creates the E-R Model directly matching the
purchase order:
When we follow the steps to convert to a set of relations this results
in two relations (keys are underlined):
PO_HEADER (PO_Number, PODate, Vendor, Ship_To, ...)
LINE_ITEMS (PO_Number, ItemNum, PartNum, Description, Price, Qty)
Consider some sample data for the LINE_ITEMS relation:
PO_Number
|
ItemNum
|
PartNum
|
Description
|
Price
|
Qty
|
0101
|
I01
|
P99
|
Cup
|
$3.00
|
7
|
0101
|
I02
|
P98
|
Plate
|
$1.00
|
11
|
0101
|
I03
|
P77
|
Bowl
|
$2.00
|
6
|
0102
|
I01
|
P99
|
Plate
|
$3.00
|
5
|
0102
|
I02
|
P77
|
Bowl
|
$2.00
|
5
|
0103
|
I01
|
P33
|
Fork
|
$2.00
|
8
|
v
What are some of the problems with this relation
?
v
What happens if we want to add the fact that
Order O103 has quantity 5 of part P99 ?
v
What happens when we delete item I02 from Order
O101 ?
v
What happens if we want to change the price of
the Plate (P99)?
v
These problems occur because the relation in
question contains data about 2 or more themes.
Normalization
Process
v
Relations can fall into one or more categories
(or classes) called Normal Forms
v
Normal Form: A class of relations free from a
certain set of modification anomalies.
Normal
forms are given names such as:
v
First normal form (1NF)
v
Second normal form (2NF)
v
Third normal form (3NF)
v
Boyce-Codd normal form (BCNF)
v
Fourth normal form (4NF)
v
Fifth normal form (5NF)
v
Domain-Key normal form (DK/NF)
These forms are cumulative. A relation in Third normal form is also in
2NF and 1NF.
The Normalization Process for a given relation consists of:
v
Specify the Key of the relation
v
Specify the functional dependencies of the
relation.
v
Sample data (tuples) for the relation can assist
with this step.
v
Apply the definition of each normal form
(starting with 1NF).
v
If a relation fails to meet the definition of a
normal form, change the relation (most often by splitting the relation into two
new relations) until it meets the definition.
Re-test the modified/new relations to ensure they meet the definitions
of each normal form.
First
Normal Form (1NF):
A relation is in first normal form if it meets the definition of a
relation:
v
Each attribute (column) value must be a single
value only.
v
All values for a given attribute (column ) must
be of the same type.
v
Each attribute (column) name must be unique.
v
The order of attributes (columns) is
insignificant
v
No two tuples (rows) in a relation can be
identical.
v
The order of the tuples (rows) is insignificant.
v
If you have a key defined for the relation, then
you can meet the unique row requirement.
v
Example relation in 1NF (note that key
attributes are underlined):
v
STOCKS (Company, Symbol, Headquarters, Date,
Close_Price)
Company
|
Symbol
|
Headquarters
|
Date
|
Close_Price
|
Microsoft
|
MSFT
|
Redmond, WA
|
09/07/2013
|
23.96
|
Microsoft
|
MSFT
|
Redmond, WA
|
09/08/2013
|
23.93
|
Microsoft
|
MSFT
|
Redmond, WA
|
09/09/2013
|
24.01
|
Oracle
|
ORCL
|
Redwood Shores, CA
|
09/07/2013
|
24.27
|
Oracle
|
ORCL
|
Redwood Shores, CA
|
09/08/2013
|
24.14
|
Oracle
|
ORCL
|
Redwood Shores, CA
|
09/09/2013
|
24.33
|
Note that the key (which consists of the Symbol and the Date) can
uniquely determine the Company, headquarters and Close Price of the stock. Here
was assume that Symbol must be unique but Company, Headquarters, Date and Price
are not unique
Second
Normal Form (2NF):
A relation is in second normal form (2NF) if all of its non-key
attributes are dependent on all of the key.
v
Another way to say this: A relation is in second
normal form if it is free from partial-key dependencies
v
Relations that have a single attribute for a key
are automatically in 2NF.
v
This is one reason why we often use artificial
identifiers (non-composite keys) as keys.
v
In the example below, Close Price is dependent
on Company, Date
v
The following example relation is not in 2NF:
STOCKS (Company, Symbol, Headquarters, Date, Close_Price).
Company
|
Symbol
|
Headquarters
|
Date
|
Close_Price
|
Microsoft
|
MSFT
|
Redmond, WA
|
09/07/2013
|
23.96
|
Microsoft
|
MSFT
|
Redmond, WA
|
09/08/2013
|
23.93
|
Microsoft
|
MSFT
|
Redmond, WA
|
09/09/2013
|
24.01
|
Oracle
|
ORCL
|
Redwood Shores, CA
|
09/07/2013
|
24.27
|
Oracle
|
ORCL
|
Redwood Shores, CA
|
09/08/2013
|
24.14
|
Oracle
|
ORCL
|
Redwood Shores, CA
|
09/09/2013
|
24.33
|
To start the normalization process, list the functional dependencies
(FD):
FD1: Symbol, Date → Company, Headquarters, Close Price
FD2: Symbol → Company, Headquarters
Consider that Symbol, Date → Close Price.
So we might use Symbol, Date as our key.
However we also see that: Symbol → Headquarters
v
This violates the rule for 2NF in that a part of
our key key determines a non-key attribute.
v
Another name for this is a Partial key
dependency. Symbol is only a “part” of the key and it determines a non-key
attribute.
v
Also, consider the insertion and deletion
anomalies.
One Solution: Split this up into two new relations:
COMPANY (Company, Symbol, Headquarters)
STOCK_PRICES (Symbol, Date, Close_Price)
At this point we have two new relations in our relational model. The
original “STOCKS” relation we started with is removed form the model.
Sample data and functional dependencies for the two new relations:
COMPANY Relation:
Company
|
Symbol
|
Headquarters
|
Microsoft
|
MSFT
|
Redmond, WA
|
Oracle
|
ORCL
|
Redwood Shores, CA
|
FD1: Symbol → Company, Headquarters
STOCK_PRICES relation:
Symbol
|
Date
|
Close Price
|
MSFT
|
09/07/2013
|
23.96
|
MSFT
|
09/08/2013
|
23.93
|
MSFT
|
09/09/2013
|
24.01
|
ORCL
|
09/07/2013
|
24.27
|
ORCL
|
09/08/2013
|
24.14
|
ORCL
|
09/09/2013
|
24.33
|
FD1: Symbol, Date → Close Price
In checking these new relations we can confirm that they meet the
definition of 1NF (each one has well defined unique keys) and 2NF (no partial
key dependencies).
Third
Normal Form (3NF)
A relation is in third normal form (3NF) if it is in second normal form
and it contains no transitive dependencies.
Consider relation R containing attributes A, B and C. R(A, B, C)
If A → B and B → C then A → C
Transitive Dependency: Three attributes with the above dependencies.
Example: At CUNY:
Course_Code → Course_Number, Section
Course_Number, Section → Classroom, Professor
Consider one of the new relations we created in the STOCKS example for
2nd normal form:
Company
|
Symbol
|
Headquarters
|
Microsoft
|
MSFT
|
Redmond, WA
|
Oracle
|
ORCL
|
Redwood Shores, CA
|
The functional dependencies we can see are:
FD1: Symbol → Company
FD2: Company → Headquarters
so therefore:
Symbol → Headquarters
This is a transitive dependency.
v
What happens if we remove Oracle?
v
We loose information about 2 different facts.
The solution again is to split this relation up into two new relations:
STOCK_SYMBOLS(Company, Symbol)
COMPANY_HEADQUARTERS(Company, Headquarters)
This gives us the following sample data and FD for the new relations
Company
|
Symbol
|
Microsoft
|
MSFT
|
Oracle
|
ORCL
|
FD1: Symbol → Company
Company
|
Headquarters
|
Microsoft
|
Redmond, WA
|
Oracle
|
Redwood Shores, CA
|
FD1: Company → Headquarters
Again, each of these new relations should be checked to ensure they
meet the definition of 1NF, 2NF and now 3NF.
Boyce-Codd
Normal Form (BCNF)
v
A relation is in BCNF if every determinant is a
candidate key.
v
Recall that not all determinants are keys.
v
Those determinants that are keys we initially
call candidate keys.
v
Eventually, we select a single candidate key to
be the key for the relation.
Consider the following example:
v
Funds consist of one or more Investment Types.
v
Funds are managed by one or more Managers
v
Investment Types can have one more Managers
v
Managers only manage one type of investment.
Relation: FUNDS (FundID, InvestmentType, Manager)
FundID
|
InvestmentType
|
Manager
|
99
|
Common Stock
|
Smith
|
99
|
Municipal Bonds
|
Jones
|
33
|
Common Stock
|
Green
|
22
|
Growth Stocks
|
Brown
|
11
|
Common Stock
|
Smith
|
FD1: FundID, InvestmentType →
Manager
FD2: FundID, Manager → InvestmentType
FD3: Manager → InvestmentType
In this case, the combination FundID and InvestmentType form a
candidate key because we can use FundID,InvestmentType to uniquely identify a
tuple in the relation.
Similarly, the combination FundID and Manager also form a candidate key
because we can use FundID, Manager to uniquely identify a tuple.
Manager by itself is not a candidate key because we cannot use Manager
alone to uniquely identify a tuple in the relation.
Is this relation FUNDS(FundID, InvestmentType, Manager) in 1NF, 2NF or
3NF ?
Given we pick FundID, InvestmentType as the Primary Key: 1NF for sure.
2NF because all of the non-key attributes (Manager) is dependant on all
of the key.
3NF because there are no transitive dependencies.
However consider what happens if we delete the tuple with FundID 22. We
loose the fact that Brown manages the InvestmentType “Growth Stocks.”
Therefore, while FUNDS relation is in 1NF, 2NF and 3NF, it is in BCNF
because not all determinants (Manager in FD3) are candidate keys.
The following are steps to normalize a relation into BCNF:
List all of the determinants.
See if each determinant can act as a key (candidate keys).
For any determinant that is not a candidate key, create a new relation
from the functional dependency. Retain the determinant in the original
relation.
For our example:
FUNDS (FundID, InvestmentType, Manager)
The determinants are:
FundID, InvestmentType, FundID, Manager,Manager
Which determinants can act as keys ?
FundID, InvestmentType YES,FundID, Manager YES,Manager NO
Create a new relation from the functional dependency:
MANAGERS(Manager, InvestmentType),FUND_MANAGERS(FundID, Manager)
In this last step, we have retained the determinant “Manager” in the
original relation MANAGERS.
Each of the new relations sould be checked to ensure they meet the
definitions of 1NF, 2NF, 3NF and BCNF
Fourth
Normal Form (4NF)
v
A relation is in fourth normal form if it is in
BCNF and it contains no multivalued dependencies.
v
Multivalued Dependency: A type of functional
dependency where the determinant can determine more than one value.
More formally, there are 3 criteria:
There must be at least 3 attributes in the relation. call them A, B,
and C, for example.
Given A, one can determine multiple values of B.
Given A, one can determine multiple values of C.
B and C are independent of one another.
Book example:
Student has one or more majors.
Student participates in one or more activities.
StudentID
|
Major
|
Activities
|
100
|
CIS
|
Baseball
|
100
|
CIS
|
Volleyball
|
100
|
Accounting
|
Baseball
|
100
|
Accounting
|
Volleyball
|
200
|
Marketing
|
Swimming
|
FD1: StudentID →→ Major
FD2: StudentID →→ Activities
Portfolio ID
|
Stock Fund
|
Bond Fund
|
999
|
Janus Fund
|
Municipal Bonds
|
999
|
Janus Fund
|
Dreyfus Short-Intermediate Municipal Bond Fund
|
999
|
Scudder Global Fund
|
Municipal Bonds
|
999
|
Scudder Global Fund
|
Dreyfus Short-Intermediate Municipal Bond Fund
|
888
|
Kaufmann Fund
|
T. Rowe Price Emerging Markets Bond Fund
|
A few characteristics:
v
No regular functional dependencies
v
All three attributes taken together form the
key.
v
Latter two attributes are independent of one
another.
v
Insertion anomaly: Cannot add a stock fund
without adding a bond fund (NULL Value). Must always maintain the combinations
to preserve the meaning.
Stock Fund and Bond Fund form a multivalued dependency on Portfolio ID.
PortfolioID →→ Stock Fund
PortfolioID →→ Bond Fund
Resolution: Split into two tables with the common key:
Portfolio ID
|
Stock Fund
|
999
|
Janus Fund
|
999
|
Janus Fund
|
999
|
Scudder Global Fund
|
999
|
Scudder Global Fund
|
888
|
Kaufmann Fund
|
Portfolio ID
|
Bond Fund
|
999
|
Municipal Bonds
|
999
|
Dreyfus Short-Intermediate Municipal Bond Fund
|
999
|
Municipal Bonds
|
999
|
Dreyfus Short-Intermediate Municipal Bond Fund
|
888
|
T. Rowe Price Emerging Markets Bond Fund
|
Fifth
Normal Form (5NF)
v
Also called “Projection Join” Normal form.
v
There are certain conditions under which after
decomposing a relation, it cannot be reassembled back into its original form.
v
We don’t consider these issues here.
Domain
Key Normal Form (DK/NF)
v
A relation is in DK/NF if every constraint on
the relation is a logical consequence of the definition of keys and domains.
v
Constraint: An rule governing static values of
an attribute such that we can determine if this constraint is True or False.
Examples:
v
Functional Dependencies
v
Multivalued Dependencies
v
Inter-relation rules
v
Intra-relation rules
v
However: Does Not include time dependent
constraints.
Key: Unique identifier of a tuple.
Domain:
The physical (data type, size, NULL values) and semantic (logical) description
of what values an attribute can hold.
There is no known algorithm for converting a relation directly into
DK/NF.
De-Normalization:
Consider the following relation:
CUSTOMER (CustomerID, Name, Address, City, State, Zip)
v
This relation is not in DK/NF because it
contains a functional dependency not implied by the key.
v
Zip → City, State
v
We can normalize this into DK/NF by splitting
the CUSTOMER relation into two:
CUSTOMER (CustomerID, Name, Address,
Zip)
ZIPCODES (Zip, City, State)
v
We may pay a performance penalty – each customer
address lookup requires we look in two relations (tables).
v
More technically, obtaining a complete customer
and address record requires us to join CUSTOMER and ZIPCODES together.
v
In such cases, we may de-normalize the relations
to achieve a performance improvement.
v
In other words, we re-assemble the original
CUSTOMER relation we started with that will contain all of the attributes.
v
De-normalization presents a trade-off between
performance and modification anomalies / data redundancy.
All-in-One Database Normalization Example
Example that would run through all of the normal forms from beginning to end
using the same tables. This is tough to do, but here is an attempt:
Example relation:
EMPLOYEE ( Name, Project, Task, Office, Floor,
Phone )
Note: Keys are underlined.
Example Data:
Name
|
Project
|
Task
|
Office
|
Floor
|
Phone
|
Bill
|
100X
|
T1
|
400
|
4
|
1400
|
Bill
|
100X
|
T2
|
400
|
4
|
1400
|
Bill
|
200Y
|
T1
|
400
|
4
|
1400
|
Bill
|
200Y
|
T2
|
400
|
4
|
1400
|
Sue
|
100X
|
T33
|
442
|
4
|
1442
|
Sue
|
200Y
|
T33
|
442
|
4
|
1442
|
Sue
|
300Z
|
T33
|
442
|
4
|
1442
|
Ed
|
100X
|
T2
|
588
|
5
|
1588
|
v
Name
is the employee’s name
v
Project
is the project they are working on. Bill is working on two different projects,
Sue is working on 3.
v
Task is
the current task being worked on. Bill is now working on Tasks T1 and T2. Note
that Tasks are independent of the project. Examples of a task might be faxing a
memo or holding a meeting.
v
Office
is the office number for the employee. Bill works in office number 400.
v
Floor
is the floor on which the office is located.
v
Phone
is the phone extension. Note this is associated with the phone in the given
office.
First Normal Form
Assume the key is Name,
Project, Task.
Is EMPLOYEE in 1NF ?
Second Normal Form
v
List all of the functional dependencies for
EMPLOYEE.
v
Are all of the non-key attributes dependant on
all of the key ?
v
It seems if we know the employee’s name, we can
figure out their office, floor and phone.
Split into two relations EMPLOYEE_PROJECT_TASK and EMPLOYEE_OFFICE_PHONE.
EMPLOYEE_PROJECT_TASK (Name, Project, Task)
Name
|
Project
|
Task
|
Bill
|
100X
|
T1
|
Bill
|
100X
|
T2
|
Bill
|
200Y
|
T1
|
Bill
|
200Y
|
T2
|
Sue
|
100X
|
T33
|
Sue
|
200Y
|
T33
|
Sue
|
300Z
|
T33
|
Ed
|
100X
|
T2
|
EMPLOYEE_OFFICE_PHONE (Name, Office, Floor, Phone)
Name
|
Office
|
Floor
|
Phone
|
Bill
|
400
|
4
|
1400
|
Sue
|
442
|
4
|
1442
|
Ed
|
588
|
5
|
1588
|
Third Normal Form
v
Assume each office has exactly one phone number.
v
Are there any transitive dependencies ?
v
Where are the modification anomalies in
EMPLOYEE_OFFICE_PHONE ?
Split EMPLOYEE_OFFICE_PHONE into two new relations.
EMPLOYEE_PROJECT_TASK (Name, Project, Task)
Name
|
Project
|
Task
|
Bill
|
100X
|
T1
|
Bill
|
100X
|
T2
|
Bill
|
200Y
|
T1
|
Bill
|
200Y
|
T2
|
Sue
|
100X
|
T33
|
Sue
|
200Y
|
T33
|
Sue
|
300Z
|
T33
|
Ed
|
100X
|
T2
|
EMPLOYEE_OFFICE (Name, Office, Floor)
Name
|
Office
|
Floor
|
Bill
|
400
|
4
|
Sue
|
442
|
4
|
Ed
|
588
|
5
|
EMPLOYEE_PHONE (Office, Phone)
Office
|
Phone
|
400
|
1400
|
442
|
1442
|
588
|
1588
|
Boyce-Codd Normal Form
v
List all of the functional dependencies for
EMPLOYEE_PROJECT_TASK, EMPLOYEE_OFFICE and EMPLOYEE_PHONE. Look at the
determinants.
v
Are all determinants candidate keys ?
Forth Normal Form
v
Are there any multivalued dependencies ?
v
What are the modification anomalies ?
Split EMPLOYEE_PROJECT_TASK.
EMPLOYEE_PROJECT (Name, Project )
Name
|
Project
|
Bill
|
100X
|
Bill
|
200Y
|
Sue
|
100X
|
Sue
|
200Y
|
Sue
|
300Z
|
Ed
|
100X
|
EMPLOYEE_TASK (Name, Task )
Name
|
Task
|
Bill
|
T1
|
Bill
|
T2
|
Sue
|
T33
|
Ed
|
T2
|
EMPLOYEE_OFFICE (Name, Office, Floor)
Name
|
Office
|
Floor
|
Bill
|
400
|
4
|
Sue
|
442
|
4
|
Ed
|
588
|
5
|
OFFICE_PHONE (Office, Phone)
Office
|
Phone
|
400
|
1400
|
442
|
1442
|
588
|
1588
|
At each step of the process, we did the following:
v
Write out the relation
v
(optionally) Write out some example data.
v
Write out all of the functional dependencies
v
Starting with 1NF, go through each normal form and
state why the relation is in the given normal form.
Another short example
Consider the following example of normalization for a CUSTOMER
relation.
Relation Name
CUSTOMER (CustomerID, Name, Street, City, State, Zip, Phone)
Example Data
CustomerID
|
Name
|
Street
|
City
|
State
|
Zip
|
Phone
|
C101
|
Bill Smith
|
123 First St.
|
New Brunswick
|
NJ
|
07101
|
732-555-1212
|
C102
|
Mary Green
|
11 Birch St.
|
Old Bridge
|
NJ
|
07066
|
908-555-1212
|
Functional Dependencies
FD1: CustomerID → Name, Street, City, State, Zip, Phone
FD2: Zip → City, State
Normalization
1NF Meets the definition of
a relation.
2NF All non key attributes
are dependent on all of the key.
3NF Relation CUSTOMER is not
in 3NF because there is a transitive dependency.
CustomerID → Zip and Zip → City, State
Solution: Split CUSTOMER
into two relations:
CUSTOMER (CustomerID, Name, Street, Zip, Phone)
ZIPCODES (Zip, City, State)
Check both CUSTOMER and ZIPCODE to ensure they are both in 1NF up to
BCNF.
As a final step, consider de-normalization.
0 comments:
Post a Comment