EART20170 Computing,
Data Analysis & Communication skills
Lecturer: Dr Paul Connolly (F18 – Sackville
Building)
p.connolly@manchester.ac.uk
2. Computing (Excel statistics/modelling)
2
lectures
assessed
practical work
Course notes etc: http://cloudbase.phy.umist.ac.uk/people/connolly
Recommended reading: Cheeney. (1983)
Statistical methods in Geology. George, Allen & Unwin
Plan
- Two more lectures plus drop-in
sessions in computer labs
- Assessment handed out today
and need to hand in by 16:00, Tuesday December 12th.
Lecture 4
- Cell referencing and naming
datasets
- Statistical functions and
add-ins
- Analysing the Gaussian and
T-distributions in Excel.
Using Microsoft
Excel
Worksheet
Cell
Cell reference
Function bar
Basic functions
Using Microsoft
Excel
Can also `import’ text files
Entering data
Can name worksheets
Cell referencing
- Several different ways of
referencing the information in a cell.
- Difference between absolute
and relative references
The A1 reference
style
Use
To refer to
A10:E20
The range of cells in columns A through
E and rows 10 through 20
H:J
All cells in columns H through J
H:H
All cells in column H
5:10
All cells in rows 5 through 10
5:5
All cells in row 5
B15:E15
The range of cells in row 15 and columns
B through E
A10:A20
The range of cells in column A and rows
10 through 20
A10
The cell in column A and row 10
To refer to
another worksheet
- You can also use worksheet
functions in the same way
- The worksheets must be in
the same workbook
Relative, absolute
and mixed references
- A relative cell reference,
such as A1, is based on the relative position of the cell. If the position
of the cell that contains the reference changes, the reference itself
is changed.
- An absolute cell reference,
such as $A$1, always refers to a cell in a specific location. If the
position of the cell that contains the formula changes, the absolute
reference remains the same.
- A mixed reference has either
an absolute column and relative row, or absolute row and relative column.
An absolute column reference takes the form $A1, $B1, and so on. An
absolute row reference takes the form A$1, B$1, and so on.
Relative references
Suppose I enter that cell b2 is equal
to a1.
If I copy cell b2 to b3, the relative
reference automatically adjusts to be the next cell relative to a1 –
a2!
This is the default in Excel
Absolute references
If I put in an absolute cell reference,
the cell reference does not change when copied to other cells
Mixed references
What happens if I copy a relative column,
absolute row reference to the c3 cell?
The relative column reference adjusts,
but the absolute column reference does not
Naming datasets
You can name a dataset by selecting it
with the mouse (left clicking and dragging) and entering the name in
the `reference box’
The name must not have any `spaces’.
The name can be used as a reference in
that worksheet.
Aside: An example
of plotting data in excel
There are many ways of plotting the data
that get the same result
Usually you will do this by trial and
error (i.e. create a plot and edit it
Say you wanted to plot two similar things
on the same graph, one a line and one a histogram
Start by selecting all three columns
and go to
insert->chart
Aside: An example
of plotting data in excel
In this case you want to plot a custom
line and column plot
Select Custom types and find the line
column graph
Aside: An example
of plotting data in excel
As you can see, Excel wants to plot all
three columns, which we don’t want in this case
But you can remove the first column by
clicking on the Series tab, highlighting the series you want to remove
and clicking remove
Aside: An example
of plotting data in excel
Clicking next you can play around with
many settings and annotate your plot
This requires some playing with to investigate
all the options
Aside: An example
of plotting data in excel
As you can see the bar chart has large
gaps which is not what we want
You can change this by right clicking
on the bars and selecting Format Data Series…
On the next dialog, click on the options
tab and set the Gap widths to 0
Entering formulas
- You can use Excel just like
a calculator
- This involves entering numerical
expressions
- For example: imagine I wanted
to calculate the mean of 3 numbers: 2, 3 and 7
- I could enter in cell a1:
Entering formulas
- A better way might be to enter
the values in cells, a1, a2 and a3
- This is still repetitive for
large datasets, but it is useful for calculating things like t-values
- But Excel has many worksheet
functions…
Worksheet functions
- Functions are predefined formulas
that perform calculations by using specific values, called arguments,
in a particular order, or structure. Functions can be used to perform
simple or complex calculations. For example, the `average’ function
calculates the mean of data in a given range.
- =average(a1:a10)
- =average(data1)
- Excel has a huge library of
functions like this. Using them takes practice but it is best
to learn by doing examples.
Worksheet functions
- The easiest way to start using
functions is to `insert’ one into the worksheet.
- Have a cell highlighted and
click on the insert->function tab.
Worksheet functions
- The insert function dialog
pops-up.
- You can either search for
a function or find it yourself.
- Usually you will want to use
a certain category such as `statistical’ and find the function in
that category
Worksheet functions
- Next you will get a dialog
asking you to put in cell references for the function.
- This dialog changes for different
functions.
- For the above example using
the `average’ function we could put in A1:A3 in the first text box
and click OK.
- The mean will appear in the
cell.
Note that when you become efficient with
excel you will use shortcuts for inserting functions and wont need to
use this dialog.
Statistical
worksheet functions (hint, hint)
- Average(a1:a3) [average(data1)]
- Stdev(a1:a3) [stdev(data1)]
- Var(a1:a3) [variance]
- Stdevp(a1:a3) [stdevp(data1)]
- Sqrt($a$1) [sqrt(a1)]
- ^2 [to the power]
- Sum(a1:a3) [sum(data1)]
- pearson(a1:a3,b1:b3)
[pearson(data1,data2)]
- Rsq(a1:a3,b1:b3) [rsq(data1,data2)]
- Quartile(a1:a20,0.25) [quartile(data1,0.5]
- Mode(a1:a20) [mode(data1)]
- Normdist(a1,mean,std,TRUE)
- Norminv(p,mean,std)
- Tdist(a1,8,2) [students t-distribution]
- Tinv(p,8) [inverse of t-distribution]
- Ttest(data1,data2,tails,type) [comparing
two means] probability of rejecting null hypothesis (if this is less
than significance level, reject null hypothesis). Type should equal
2.
Analysing the
Gaussian (normal) distribution
- The function normdist has
the following prototype:
- NORMDIST(x,mean,std,cumulative)
- It returns the value of a
Gaussian distribution with given mean and standard deviation at x.
- Cumulative is either set
to `FALSE’ or `TRUE’
- If it is set to FALSE the
function will return the actual value of the Gaussian distribution at
x.
- If it is set to TRUE the
function will return the cumulative distribution at x (i.e. this is
the table from lecture 3).
- Confidence level of a given
interval. Eg what is the probability of a value lying in the interval...
Analysing the
Gaussian (normal) distribution
- The function norminv has
the following prototype:
- NORMINV(p,mean,std)
- It returns the x value associated
with the cumulative probability of p
- Useful for assessing levels
of significance. Eg what are the limits on x at a given confidence level?
- This method is used more
frequently for the t-distribution
Analysing the
t-distribution
- You still need to remember
the formulas for (1) estimating the interval for the mean; (2) testing
the significance of the correlation coefficient; and (3) if two means
are equal.
- The function tdist has the
following prototype:
- TDIST(x,df,tails)
- It returns the significance
level (alpha) of a t-distribution with given degrees of freedom.
- Tails is either set to 1
or 2.
- If it is set to 1 the function
will return the accumulation of probability from infinity to x.
- If it is set to 2 the function
will return the accumulation of probability in both tails.
- Not used too often
Analysing the
t-distribution
- The function tinv has the
following prototype:
- TINV(alpha,df)
- It returns the critical value
for the t-distribution corresponding to a significance level, alpha.
- By default it is a two tailed
confidence level, but for a one tailed confidence level substitute 2x(alpha)
for alpha.
- Used in hypothesis testing.
Excel has a
quick way of comparing two means:
- The student t-test.
- If they have the same length
we can use the TTEST function
- But rather than giving us
the critical t-value, it gives us a critical probability for rejection.
- What if this is less than
the significance value, alpha?
T-critical
Rejection
Acceptance
We reject the null hypothesis
Random number
generation
- This is very useful in computational
science.
- They are not really random
numbers, they are generated by an algorithm.
- But it is difficult to get
random numbers on a computer.
- The worksheet function rand()
generates random numbers between 0 and 1.
- Hence to generate a normally
distributed random number sequence, with a given mean and standard deviation
we can use:
- Norminv(rand(),mean,std)
- Over many generations, the
variable will have the given mean and standard deviation
Random number
generation
Over a large number of generations, the
mean approaches that of the true random variable
Next lecture
- Many people struggled with
this on the test.
- There is an easy way to do
it with Excel.
- Lets go back to our example
measuring bed thickness. We know that x=12.1?0.3m and y=4.2?0.2m.
- So if we generate many values
of x and y with the above means and std, then add all these together
we could calculate the mean and std of the result.
Error propagation
- Using a computer we can generate
many normally distributed variables and therefore find the distribution
of the answer.
- Hence we can directly calculate
the error (standard deviation) in the answer.
- This is useful as it can
be done for complicated equations with ease.
- This is called the Monte
Carlo method of propagating errors.
One more thing
- Worksheet functions can also
take text arguments and return text.
- If you are doing a hypothesis
test you might want a statement that tells you to either accept or reject
the null hypothesis
- You can use the `IF’ construct
- =IF(logical test,if true,if
false)
- E.g. IF(tvalue>tcrit,”Accept
alternate hypothesis”,”Accept null hypothesis”).
- IF can also return numeric
values.
Homework
- Have a look at using some
of the functions in excel.
- Especially for manipulating
cells and calculating means, standard deviations.
- Try calculating z-values
and t-values from the table from handout 3 in excel to see if you can
get them correct.
- This will help for Tuesdays
practical labs.
- REMEMBER: please check the
student notice board for your allotted time.