Descriptive summarize characteristics of the sample data values for
one or more variables. The **lessR** `pivot()`

function serves as a single source for a wide variety and types of
descriptive statistics for one or more variables. Statistics are
computed either for the entire data set at once or separately for
different groups of data. MS Excel refers to the resulting statistical
summaries as a *pivot table*.

Aggregation: Form groups of data according to the levels of one or more categorical variables, then compute some statistical value of a numeric variable, such as a mean, for each group.

As an example of aggregation, compute the mean Years worked for each
combination of the levels of *Gender* and *Dept*
(department) of employment. To introduce the ease of use of
`pivot()`

, and provide context for the following discussion
this function call listed below includes the parameter names (in red)
for emphasis, but if the parameters are entered in this order, listing
their names is not necessary. Enter multiple values, such as for
*Gender* and *Dept* as an R vector, defined such as with
the base R combine `c()`

function. The output of the analysis
is shown later.

The `pivot()`

function computes statistics for three
classes of variables:

- Numerical variables that represent continuity such as time, money, and weight.
- Numerical variables that represent ordered categories, such as Likert scale responses Strongly Disagree to Strongly Agree on a 5-pt scale coded 1 to 5.
- Non-numerical categorical variables such as Gender encoded as Male, Female, and Other, or coded numerically but without numerical properties, such as 0, 1, and 2 for three values of Gender.

Any function that processes a single vector of data, such as a column
of data values for a variable in a data frame, and outputs a single
computed value, can be passed `pivot()`

. The phrase “any
function” is quite general, including any function in any active package
as well as user-defined functions that satisfy these criteria. Typically
the function accessed by `pivot()`

is one of the many
available R statistical functions that summarize data, including those
in the following table.

Statistic | Meaning |
---|---|

`sum` |
sum |

`mean` |
arithmetic mean |

`median` |
median |

`min` |
minimum |

`max` |
maximum |

`sd` |
standard deviation |

`var` |
variance |

`skew` |
skew |

`kurtosis` |
kurtosis |

`IQR` |
inter-quartile range |

`mad` |
mean absolute deviation |

The statistics `skew`

and `kurtosis`

have no
counterparts in base R, so are provided by **lessR**.
Computations of all other statistics follow from base R functions.

The `quantile`

and `table`

computations return
multiple values, and so are accessible to `pivot()`

only
because of internal programming that recognizes these functions and then
processes the resulting multiple output values.

Statistic | Meaning |
---|---|

`quantile` |
min, quartiles, max |

`table` |
cell counts or proportions |

The `table`

computation applies to an aggregated variable
that consists of discrete categories, such as the numbers 1 through 5
for responses to a 5-pt Likert scale. The result is a table of
frequencies or proportions, referred to for two or more variables as
either a contingency table, a cross-tabulation table, or a joint
frequency distribution. Only the `table`

computation applies
to non-numeric as well as numeric variables, though only meaningful if
the aggregated variable consists of a relatively small set of discrete
values, character strings or numeric.

The default quantiles for `quantile`

are quartiles.
Specify a custom number of quantiles with the `q_num`

parameter, which has the default value of 4 for quartiles.

When calculating a statistic, the analyst should be aware of the
present and missing (not available) data that underlies the computed
statistic. Accordingly, `pivot()`

by default provides the
sample size information for each calculation. Turn off the display of
information with the `show_n`

parameter.

The following `pivot()`

parameters specify the data, one
or more statistics to compute for the aggregation, the variables over
which to aggregate, and the corresponding groups that contain the
aggregated values. The first three parameter values listed below are
required: the data frame, at least one statistic to compute, and at
least one variable for which to compute the statistic(s).

`data`

: The data frame that includes the variables of interest.`compute`

: One or more functions that specify the corresponding statistics to compute.`variable`

: One or more numerical variables to summarize, either by aggregation over groups or the entire sample as a single group.`by`

: Specify the optional aggregation according to the categorical variable(s) that define the groups.`by_cols`

: The optional categorical variable(s) that define the groups or cells for which to compute the aggregated values, listed as columns in a two-dimensional table.

For the given `data`

, `compute`

at least one
statistic for at least one `variable`

for each group
specified by `by`

and possibly `by_cols`

.

If no categorical (`by`

) variables are selected to define
groups, then one or more statistics are computed over multiple variables
over the entire data table defined as a single group. For categorical
(grouping) variables, defined with `by`

and optionally
`by_cols`

, can choose either computer multiple statistics or
multiple `variable`

(s), but not both.

Key Idea: Select any two of the three possibilities for multiple parameter values: Multiple compute functions, multiple variables over which to compute, and multiple categorical variables by which to define groups for aggregation.

Specify multiple descriptive statistics to compute, multiple values
for which to do the computation, and multiple categorical variables to
define groups as vectors such as with the base R `c()`

function.

The output of `pivot()`

is a two-dimensional table, rows
and columns. The output table can have multiple rows and multiple
columns according to the choice of parameter values. For each numerical
variable in the analysis, `pivot()`

displays both the
corresponding sample size as `n_`

and amount of missing or
not available data as `na_`

. Prevent displaying sample size
information by setting parameter `show_n`

to
`FALSE`

.

The output follows one of three general forms.

- The classic form is a data frame with the categorical variables in the analysis listed as columns first, followed by the numerical variables, ready for input into data analysis procedures. Each row of the output data frame consists of the corresponding values for the levels of the categorical variables and the corresponding values of the aggregated statistics.
- If one or two categorical variables are specified with the
`by_cols`

parameter, the output is a table with the specified categorical variables in the columns, amenable for viewing. - If there is no aggregation specified, no
`by`

variables, the corresponding statistics are computed over the entire data frame defined as a single group. - There may be many variables to summarize, such as all the items on a
multi-item scale from a survey analysis, so the output is a data frame
with variable names that each indicate the corresponding computed
statistic and the rows defined by the combinations of levels of the
`by`

variables.

To illustrate, use the 37-row Employee data set included with
**lessR**, here read into the *d* data frame.

`<- Read("Employee") d `

Two categorical variables in the *d* data frame are
*Dept* and *Gender*. Continuous variables include
*Years* worked at the company and annual *Salary*.

This example does the analysis previously illustrated. The name of
the computed mean of *Years* by default is *Years_mean*.
With no `by_cols`

variables, the output of
`pivot()`

is a data frame of the aggregated variables.
Optionally save this output for further analysis, as the data frame
*a* in this example.

```
<- pivot(data=d, compute=mean, variable=Years, by=c(Dept, Gender))
a a
```

```
## Dept Gender Years_n Years_na Years_mean
## 1 ACCT M 2 0 7.000
## 2 ADMN M 2 0 15.500
## 3 FINC M 3 0 11.333
## 4 MKTG M 1 0 18.000
## 5 SALE M 9 1 12.333
## 6 <NA> M 0 0 0.000
## 7 ACCT W 3 0 4.667
## 8 ADMN W 4 0 7.500
## 9 FINC W 1 0 7.000
## 10 MKTG W 5 0 8.200
## 11 SALE W 5 0 6.600
## 12 <NA> W 1 0 5.000
```

Using the **lessR** function `Read()`

, can
read the original data table from which the pivot table was constructed,
such as in the form of an Excel worksheet. For many analyses, easier to
read the Excel data into R and do the analysis in R than in Excel. The
result can also be written back into an Excel file with the
**lessR** function `Write()`

.

In this example, create an Excel file called
*MyPivotTable.xlsx* from the pivot table stored in the *a*
data frame. To avoid creating this file in this example, the function
call is commented out with the `#`

symbol in the first
column.

`#Write("MyPivotTable", data=a, format="Excel")`

The abbreviation `wrt_x()`

for the function name
simplifies the preceding expression, with the `format`

parameter dropped.

In this next example, specify multiple statistics for which to
aggregate for each group for the specified value variable
*Salary*. For each group, compare the mean to the median, and the
standard deviation to the interquartile range. By default, each column
of an aggregated statistic is the `variable`

name, here
*Salary*, followed by a “_”, then either the name of the
statistic or an abbreviation. The respective abbreviations for
`mean`

and `median`

are `mn`

and
`md`

.

`pivot(d, c(mean, median, sd, IQR), Salary, Dept)`

```
## Dept Salary_n Salary_na Salary_mean Salary_md Salary_sd Salary_IQR
## 1 ACCT 5 0 61792.78 69547.60 12774.61 21379.23
## 2 ADMN 6 0 81277.12 71058.60 27585.15 36120.57
## 3 FINC 4 0 69010.68 61937.62 17852.50 16034.81
## 4 MKTG 6 0 70257.13 61658.99 19869.81 26085.69
## 5 SALE 15 0 78830.07 77714.85 23476.84 28810.28
## 6 <NA> 1 0 53772.58 53772.58 NA 0.00
```

Also have available two functions provided by **lessR**
that are not part of base R: `skew()`

and
`kurtosis()`

.

`pivot(d, c(mean,sd,skew,kurtosis), Salary, Dept, digits_d=3)`

```
## Dept Salary_n Salary_na Salary_mean Salary_sd Salary_sk Salary_kt
## 1 ACCT 5 0 61792.78 12774.61 -0.623 -3.032
## 2 ADMN 6 0 81277.12 27585.15 0.835 -1.185
## 3 FINC 4 0 69010.68 17852.50 1.689 2.752
## 4 MKTG 6 0 70257.13 19869.81 0.859 -1.458
## 5 SALE 15 0 78830.07 23476.84 0.863 0.856
## 6 <NA> 1 0 53772.58 NA NA NA
```

Can also specify the variable names of the aggregated statistics with
the `out_names`

parameter. Here calculate the mean and median
*Salary* for each group defined by each combination of levels for
*Gender* and *Dept*.

`pivot(d, c(mean, median), Salary, c(Gender,Dept), out_names=c("MeanSalary", "MedianSalary"))`

```
## Gender Dept Salary_n Salary_na MeanSalary MedianSalary
## 1 M ACCT 2 0 59626.19 59626.19
## 2 W ACCT 3 0 63237.16 71084.02
## 3 M ADMN 2 0 80963.35 80963.35
## 4 W ADMN 4 0 81434.00 71058.60
## 5 M FINC 3 0 72967.60 66312.89
## 6 W FINC 1 0 57139.90 57139.90
## 7 M MKTG 1 0 99062.66 99062.66
## 8 W MKTG 5 0 64496.02 61356.69
## 9 M SALE 10 0 86150.97 82442.74
## 10 W SALE 5 0 64188.25 56508.32
## 11 M <NA> 0 0 0.00 NA
## 12 W <NA> 1 0 53772.58 53772.58
```

The `pivot()`

function can also aggregate over multiple
`variables`

. Here, aggregate *Years* and
*Salary*. Round the numerical aggregated results to the nearest
integer with the `digits_d`

parameter, which specifies the
number of decimal digits in the output. Different variables can have
different amounts of missing data, so the sample size, *n*, and
number missing, the number of values Not Available, *na*, are
listed separately for each aggregated variable.

`pivot(d, mean, c(Years, Salary), c(Dept, Gender), digits_d=0)`

```
## Dept Gender Years_n Years_na Years_mean Salary_n Salary_na Salary_mean
## 1 ACCT M 2 0 7 2 0 59626
## 2 ADMN M 2 0 16 2 0 80963
## 3 FINC M 3 0 11 3 0 72968
## 4 MKTG M 1 0 18 1 0 99063
## 5 SALE M 9 1 12 10 0 86151
## 6 <NA> M 0 0 0 0 0 0
## 7 ACCT W 3 0 5 3 0 63237
## 8 ADMN W 4 0 8 4 0 81434
## 9 FINC W 1 0 7 1 0 57140
## 10 MKTG W 5 0 8 5 0 64496
## 11 SALE W 5 0 7 5 0 64188
## 12 <NA> W 1 0 5 1 0 53773
```

Turn off the display of the sample size and number of missing values
for each group by setting the parameter `show_n`

to
`FALSE`

.

`pivot(d, mean, c(Years, Salary), Dept, digits_d=2, show_n=FALSE)`

```
## Dept Years_mean Salary_mean
## 1 ACCT 5.60 61792.78
## 2 ADMN 10.17 81277.12
## 3 FINC 10.25 69010.68
## 4 MKTG 9.83 70257.13
## 5 SALE 10.29 78830.06
## 6 <NA> 5.00 53772.58
```

Aggregation computes one or more statistics for one or more variables
across groups defined by the possible combinations of the levels of one
or more categorical variables. A related computation computes the
variables for each statistic for all the data. To compute over all the
rows of data, do not specify groups, that is, drop the `by`

parameter.

Get the grand mean of Years, that is, for all the data.

`pivot(d, mean, Years)`

```
## Years_n Years_na Years_mean
## 36 1 9.389
```

Get the grand mean of Years and Salary. Specify custom names for the results.

`pivot(d, mean, c(Years, Salary), digits_d=2, out_names=c("MeanYear", "MeanSalary"))`

```
## Years_n Years_na MeanYear Salary_n Salary_na MeanSalary
## 36 1 9.39 37 0 73795.56
```

Consider an example with more variables. Analyze the 6-pt Likert
scale responses to the Mach IV scale that assesses Machiavellianism.
Items are scored from 0 to 5, Strongly Disagree to Strongly Agree. The
data are included with **lessR** as the *Mach4* data
file.

Suppress output when reading by setting `quiet`

to
`TRUE`

. Calculate the mean, standard deviation, skew, and
kurtosis for all of the data for each of the 20 items on the scale. With
this specification, the form of a data frame is statistics in the
columns and the variables in the rows. The result are the specified
summary statistics for the specified variables over the entire data
set.

`<- Read("Mach4", quiet=TRUE) d `

`pivot(d, c(mean,sd,skew,kurtosis), m01:m20)`

```
## n na mean sd skew kurtosis
## m01 351 0 1.279 1.286 0.984 0.356
## m02 351 0 1.746 1.480 0.475 -0.784
## m03 351 0 2.900 1.450 -0.363 -0.829
## m04 351 0 3.339 1.174 -0.855 0.457
## m05 351 0 2.234 1.583 0.076 -1.095
## m06 351 0 3.074 1.478 -0.454 -0.923
## m07 351 0 2.775 1.473 -0.065 -1.150
## m08 351 0 2.100 1.456 0.133 -1.078
## m09 351 0 4.225 1.155 -1.745 2.743
## m10 351 0 3.991 1.138 -1.154 0.932
## m11 351 0 1.641 1.395 0.641 -0.393
## m12 351 0 1.801 1.625 0.422 -1.116
## m13 351 0 1.385 1.368 0.836 -0.126
## m14 351 0 1.954 1.304 0.201 -0.923
## m15 351 0 2.123 1.367 -0.088 -0.995
## m16 351 0 2.177 1.782 0.237 -1.355
## m17 351 0 2.407 1.604 0.085 -1.142
## m18 351 0 2.915 1.326 -0.559 -0.326
## m19 351 0 1.157 1.425 1.290 0.872
## m20 351 0 0.895 1.351 1.499 1.224
```

Aggregating a statistical computation of a continuous variable over
groups with `pivot()`

, such as computing the mean for each
combination of *Dept* and *Gender*, by default includes
the tabulation for each group (cell). A tabulation can be requested with
no analysis of a numerical variable, instead only a counting of the
available levels of the specified categorical variables.

The `table`

value of `compute`

specifies to
compute the frequency table for a categorical aggregated
`variable`

across all combinations of the `by`

variables. Specify one categorical `variable`

with the
remaining categorical variables specified with the `by`

parameter. Missing values for each combination of the levels of the
grouping variables are displayed.

Begin with a one-way frequency table computed over the entire data set.

`pivot(d, table, m06)`

```
## m06 Freq Prop
## 1 0 18 0.051
## 2 1 47 0.134
## 3 2 63 0.179
## 4 3 44 0.125
## 5 4 121 0.345
## 6 5 58 0.165
```

In this example, compute a two-way cross-tabulation table with the
levels of `variable`

*m06* as columns and the levels
of the `by`

categorical variable *m07* as rows.

`pivot(d, table, m06, m07)`

```
## m07 m06_n m06_na 0 1 2 3 4 5
## 1 0 17 0 4 3 2 3 3 2
## 2 1 64 0 7 24 7 6 18 2
## 3 2 87 0 4 14 30 13 24 2
## 4 3 43 0 2 1 10 16 12 2
## 5 4 93 0 0 3 13 5 56 16
## 6 5 47 0 1 2 1 1 8 34
```

To output the data in long form, one tabulation per row, set the
`table_long`

parameter to TRUE.

If interested in the inferential analysis of the cross-tabulation
table, access the **lessR** function
`Prop_test()`

to obtain both the descriptive and inferential
results, though limited to a one- or two-way table.

The default data table is *d*, but included explicitly in the
following example to illustrate the `data`

parameter.

`Prop_test(m06, by=m07, data=d)`

```
## variable: m06
## by: m07
##
## <<< Pearson's Chi-squared test <<<
##
## --- Description
##
## m06
## m07 0 1 2 3 4 5 Sum
## 0 4 3 2 3 3 2 17
## 1 7 24 7 6 18 2 64
## 2 4 14 30 13 24 2 87
## 3 2 1 10 16 12 2 43
## 4 0 3 13 5 56 16 93
## 5 1 2 1 1 8 34 47
## Sum 18 47 63 44 121 58 351
##
## Cramer's V: 0.380
##
## Row Col Observed Expected Residual Stnd Res
## 1 1 4 0.872 3.128 3.526
## 1 2 3 2.276 0.724 0.528
## 1 3 2 3.051 -1.051 -0.681
## 1 4 3 2.131 0.869 0.652
## 1 5 3 5.860 -2.860 -1.496
## 1 6 2 2.809 -0.809 -0.542
## 2 1 7 3.282 3.718 2.330
## 2 2 24 8.570 15.430 6.263
## 2 3 7 11.487 -4.487 -1.616
## 2 4 6 8.023 -2.023 -0.844
## 2 5 18 22.063 -4.063 -1.182
## 2 6 2 10.575 -8.575 -3.192
## 3 1 4 4.462 -0.462 -0.259
## 3 2 14 11.650 2.350 0.853
## 3 3 30 15.615 14.385 4.634
## 3 4 13 10.906 2.094 0.782
## 3 5 24 29.991 -5.991 -1.558
## 3 6 2 14.376 -12.376 -4.119
## 4 1 2 2.205 -0.205 -0.151
## 4 2 1 5.758 -4.758 -2.274
## 4 3 10 7.718 2.282 0.968
## 4 4 16 5.390 10.610 5.216
## 4 5 12 14.823 -2.823 -0.967
## 4 6 2 7.105 -5.105 -2.238
## 5 1 0 4.769 -4.769 -2.615
## 5 2 3 12.453 -9.453 -3.357
## 5 3 13 16.692 -3.692 -1.164
## 5 4 5 11.658 -6.658 -2.432
## 5 5 56 32.060 23.940 6.092
## 5 6 16 15.368 0.632 0.206
## 6 1 1 2.410 -1.410 -1.002
## 6 2 2 6.293 -4.293 -1.976
## 6 3 1 8.436 -7.436 -3.037
## 6 4 1 5.892 -4.892 -2.315
## 6 5 8 16.202 -8.202 -2.705
## 6 6 34 7.766 26.234 11.071
##
## --- Inference
##
## Chi-square statistic: 253.103
## Degrees of freedom: 25
## Hypothesis test of equal population proportions: p-value = 0.000
```

Can also aggregate other statistics simultaneously in addition to the
frequency table, though, of course, only meaningful if the aggregated
variable is numerical. Here, create a 3-way cross-tabulation table with
responses to `variable`

*m06* in the column and
responses to `by`

variables *m07* and *m10* in
the rows, plus the mean and standard deviation of each combination of
*m07* and *m10* across levels of *m06*.

`pivot(d, c(mean,sd,table), m06, c(m07, m10))`

```
## m07 m10 m06_n m06_na m06_mean m06_sd 0 1 2 3 4 5
## 1 0 0 1 0 0.000 NA 1 0 0 0 0 0
## 2 1 0 1 0 1.000 NA 0 1 0 0 0 0
## 3 2 0 0 0 0.000 NA 0 0 0 0 0 0
## 4 3 0 1 0 2.000 NA 0 0 1 0 0 0
## 5 4 0 1 0 2.000 NA 0 0 1 0 0 0
## 6 5 0 0 0 0.000 NA 0 0 0 0 0 0
## 7 0 1 0 0 0.000 NA 0 0 0 0 0 0
## 8 1 1 4 0 0.750 0.500 1 3 0 0 0 0
## 9 2 1 2 0 1.500 0.707 0 1 1 0 0 0
## 10 3 1 1 0 2.000 NA 0 0 1 0 0 0
## 11 4 1 0 0 0.000 NA 0 0 0 0 0 0
## 12 5 1 0 0 0.000 NA 0 0 0 0 0 0
## 13 0 2 2 0 2.000 1.414 0 1 0 1 0 0
## 14 1 2 9 0 1.222 1.202 2 5 1 0 1 0
## 15 2 2 15 0 2.000 0.845 0 4 8 2 1 0
## 16 3 2 3 0 3.000 0.000 0 0 0 3 0 0
## 17 4 2 3 0 3.333 1.155 0 0 1 0 2 0
## 18 5 2 0 0 0.000 NA 0 0 0 0 0 0
## 19 0 3 5 0 1.200 1.304 2 1 1 1 0 0
## 20 1 3 16 0 2.250 1.390 2 3 4 3 4 0
## 21 2 3 13 0 2.769 0.832 0 0 6 4 3 0
## 22 3 3 7 0 2.000 1.528 2 0 2 2 1 0
## 23 4 3 9 0 3.333 1.323 0 1 2 0 5 1
## 24 5 3 0 0 0.000 NA 0 0 0 0 0 0
## 25 0 4 7 0 2.714 1.799 1 1 1 1 2 1
## 26 1 4 19 0 2.895 1.449 0 6 1 2 9 1
## 27 2 4 29 0 2.690 1.417 2 4 8 4 9 2
## 28 3 4 18 0 3.278 1.074 0 1 3 6 6 2
## 29 4 4 32 0 3.781 1.008 0 0 6 2 17 7
## 30 5 4 5 0 4.400 0.548 0 0 0 0 3 2
## 31 0 5 2 0 4.500 0.707 0 0 0 0 1 1
## 32 1 5 15 0 2.133 1.685 2 6 1 1 4 1
## 33 2 5 28 0 2.571 1.372 2 5 7 3 11 0
## 34 3 5 13 0 3.154 0.801 0 0 3 5 5 0
## 35 4 5 48 0 3.854 0.922 0 2 3 3 32 8
## 36 5 5 42 0 4.452 1.234 1 2 1 1 5 32
```

Can also express the frequencies as proportions. To convert the
frequencies into proportions, invoke the `table_prop`

parameter. The value of `"all"`

computes cell frequencies
across the entire table. The values of `"row"`

and
`"col"`

compute the proportions with either row sums or
column sums.

In this example of a two-way cross-tabulation table, convert the
table counts to row proportions, that is, the proportion of each level
of *m06* for each combination of levels for *m07*. The sum
of the proportions in each row is 1.0.

`pivot(d, table, m06, m07, table_prop="row")`

```
## m07 m06_n m06_na 0 1 2 3 4 5
## 1 0 17 0 0.235 0.176 0.118 0.176 0.176 0.118
## 2 1 64 0 0.109 0.375 0.109 0.094 0.281 0.031
## 3 2 87 0 0.046 0.161 0.345 0.149 0.276 0.023
## 4 3 43 0 0.047 0.023 0.233 0.372 0.279 0.047
## 5 4 93 0 0.000 0.032 0.140 0.054 0.602 0.172
## 6 5 47 0 0.021 0.043 0.021 0.021 0.170 0.723
```

Return to the *Employee* data set for the remaining
examples.

`<- Read("Employee", quiet=TRUE) d `

One way to understand the characteristics of a distribution of data values of a continuous variable is to sort the values and then split into equal-sized groups. The simplest example is the median, which splits a sorted distribution into two groups of equal size, the bottom lowest values and the top highest values. Quartiles divide the sorted distribution into four groups. The first quartile is the smallest 25% of the data values, etc.

Quantiles: Numeric values, not necessarily data values, that divide a distribution of sorted data values intongroups of equal size.

By default, calling the quantile function computes quartiles. Here
calculate the quartiles for *Years* aggregated across levels of
*Dept* and *Gender*.

`pivot(d, quantile, Years, c(Dept, Gender))`

```
## Dept Gender Years_n Years_na p0 p25 p50 p75 p100
## 1 ACCT M 2 0 5 6.00 7.0 8.00 9
## 2 ADMN M 2 0 7 11.25 15.5 19.75 24
## 3 FINC M 3 0 10 10.00 10.0 12.00 14
## 4 MKTG M 1 0 18 18.00 18.0 18.00 18
## 5 SALE M 9 1 5 9.00 13.0 14.00 21
## 6 <NA> M 0 0 0 NA NA NA NA
## 7 ACCT W 3 0 2 2.50 3.0 6.00 9
## 8 ADMN W 4 0 2 3.50 5.0 9.00 18
## 9 FINC W 1 0 7 7.00 7.0 7.00 7
## 10 MKTG W 5 0 1 4.00 8.0 10.00 18
## 11 SALE W 5 0 2 3.00 8.0 10.00 10
## 12 <NA> W 1 0 5 5.00 5.0 5.00 5
```

To compute other than quartiles, invoke the `q_num`

parameter, the number of quantile intervals. The default value is 4 for
quartiles. In the following example, compute the quintiles for
*Years* and *Salary*, plus the mean and standard
deviation. No specification of `by`

, so these descriptive
statistics are computed over the entire data set for both specified
variables.

`pivot(d, c(mean,sd,quantile), c(Years,Salary), q_num=5, digits_d=2)`

```
## n na mean sd p0 p20 p40 p60 p80 p100
## Years 36 1 9.39 5.72 1.00 4.00 8.00 10.00 14.00 24.0
## Salary 37 0 73795.56 21799.53 46124.97 55737.86 63701.93 72430.04 92415.42 134419.2
```

One data analysis strategy examines the values of a variable, such as Sales for a business or Mortality for an epidemiology study, at progressively finer levels of detail. Examine by Country or State or City or whatever level of granularity is appropriate.

Data drill down: Examine the values of a variable when holding the values of one or more categorical variables constant.

If drilling down into the data, `pivot()`

indicates the
drill-down with a display of all categorical variables with unique
values that precedes the primary output. Initiate the drill-down by a
previous subset of the data frame, or by `pivot()`

directly.
As with other **lessR** analysis functions, the
`rows`

parameter specifies a logical condition for which to
subset rows of the data frame for analysis.

In this example, compute the mean of *Salary* for each level
of *Dept* for just those rows of data with the value of
*Gender* equal to “F”.

`pivot(d, mean, Salary, Dept, rows=(Gender=="W"))`

`## Gender: W`

```
## Dept Salary_n Salary_na Salary_mean
## 1 ACCT 3 0 63237.16
## 2 ADMN 4 0 81434.00
## 3 FINC 1 0 57139.90
## 4 MKTG 5 0 64496.02
## 5 SALE 5 0 64188.25
## 6 <NA> 1 0 53772.58
```

The parentheses for the `rows`

parameter are not
necessary, but does enhance readability.

Can also drill down by subsetting the data frame with a logical
condition directly in the data parameter in the call to
`pivot()`

. Here drill down with base R
`Extract[ ]`

in conjunction with the **lessR**
function `.()`

to simplify the syntax (explained in the
vignette Subset a Data Frame). The `Extract[ ]`

function
specifies the rows of the data frame to extract before the comma, and
the columns to extract after the comma. Here select only those rows of
data with *Gender* declared as Female. There is no information
after the comma, so no columns are specified, which means to retain all
columns, the variables in the data frame.

`pivot(d[.(Gender=="W"),], mean, Salary, Dept)`

`## Gender: W`

```
## Dept Salary_n Salary_na Salary_mean
## 1 ACCT 3 0 63237.16
## 2 ADMN 4 0 81434.00
## 3 FINC 1 0 57139.90
## 4 MKTG 5 0 64496.02
## 5 SALE 5 0 64188.25
## 6 <NA> 1 0 53772.58
```

Specify the sort as part of the call to `pivot()`

with the
parameter `sort`

. This internal sort works for a single
`value`

variable, by default the last column in the output
data frame. Set to `"-"`

for a descending sort. Set to
`"+"`

for an ascending sort.

`pivot(d, mean, Salary, c(Dept, Gender), sort="-")`

```
## Dept Gender Salary_n Salary_na Salary_mean
## 4 MKTG M 1 0 99062.66
## 5 SALE M 10 0 86150.97
## 8 ADMN W 4 0 81434.00
## 2 ADMN M 2 0 80963.35
## 3 FINC M 3 0 72967.60
## 10 MKTG W 5 0 64496.02
## 11 SALE W 5 0 64188.25
## 7 ACCT W 3 0 63237.16
## 1 ACCT M 2 0 59626.19
## 9 FINC W 1 0 57139.90
## 12 <NA> W 1 0 53772.58
## 6 <NA> M 0 0 0.00
```

Specify the `sort_var`

parameter to specify the name or
the column number of the variable to sort.

```
pivot(d, c(mean, median), Salary, c(Gender,Dept), out_names=c("MeanSalary", "MedianSalary"),
sort="-", sort_var="MeanSalary")
```

```
## Gender Dept Salary_n Salary_na MeanSalary MedianSalary
## 7 M MKTG 1 0 99062.66 99062.66
## 9 M SALE 10 0 86150.97 82442.74
## 4 W ADMN 4 0 81434.00 71058.60
## 3 M ADMN 2 0 80963.35 80963.35
## 5 M FINC 3 0 72967.60 66312.89
## 8 W MKTG 5 0 64496.02 61356.69
## 10 W SALE 5 0 64188.25 56508.32
## 2 W ACCT 3 0 63237.16 71084.02
## 1 M ACCT 2 0 59626.19 59626.19
## 6 W FINC 1 0 57139.90 57139.90
## 12 W <NA> 1 0 53772.58 53772.58
## 11 M <NA> 0 0 0.00 NA
```

Because the output of `pivot()`

with no
`by_cols`

variables is a standard R data frame, the external
call to the **lessR** function `Sort()`

is
available for custom sorting by one or more variables. Sort in the
specified direction with the `direction`

parameter.

```
<- pivot(d, mean, Salary, c(Dept, Gender))
a Sort(a, by=Salary_mean, direction="-")
```

```
## Dept Gender Salary_n Salary_na Salary_mean
## 4 MKTG M 1 0 99062.66
## 5 SALE M 10 0 86150.97
## 8 ADMN W 4 0 81434.00
## 2 ADMN M 2 0 80963.35
## 3 FINC M 3 0 72967.60
## 10 MKTG W 5 0 64496.02
## 11 SALE W 5 0 64188.25
## 7 ACCT W 3 0 63237.16
## 1 ACCT M 2 0 59626.19
## 9 FINC W 1 0 57139.90
## 12 <NA> W 1 0 53772.58
## 6 <NA> M 0 0 0.00
```

Specify multiple variables to sort with a vector of variable names,
and a corresponding vector of `"+"`

and `"-"`

signs of the same length for the `directions`

parameter.

The following illustrates as of R 4.1.0 the base R pipe operator
`|>`

with `pivot()`

. The pipe operator by
default inserts the object on the left-hand side of an expression into
the first parameter value for the function on the right-hand side.

In this example, input the *d* data frame into the first
parameter of `pivot()`

, the `data`

parameter. Then
direct the output to the data frame *a* with the standard R
assignment statement, though written pointing to the right hand side of
the expression.

To avoid problems installing this version of **lessR**
from source with a previous version of R, the code is commented out with
a `#`

sign in the first column.

```
#d |> pivot(mean, Salary, c(Dept, Gender)) -> a
#a
```

Specify up to two `by_cols`

categorical variables to
create a two-dimensional table with the specified columns. Specifying
one or two categorical variables as `by_cols`

variables moves
them from their default position in the rows to the columns, which
changes the output structure from a long-form data frame to a
cross-tabulation table with categorical variables in the rows and
columns.

In this example, specify by `by_cols`

variable,
*Gender*.

`pivot(d, mean, Salary, Dept, Gender)`

```
Table: mean of Salary
Gender M W
Dept
------- --------- ---------
ACCT 59626.20 63237.16
ADMN 80963.34 81434.00
FINC 72967.60 57139.90
MKTG 99062.66 64496.02
SALE 86150.97 64188.25
NA 0.00 53772.58
```

Here two `by_cols`

variables, specified as a vector. There
is much missing data for this three-way classification as there is not
much data in each group, with many groups having no data.

`pivot(d, mean, Salary, Dept, c(Gender, Plan))`

```
Table: mean of Salary
Gender M W
Plan 1 2 3 1 2 3
Dept
------- --------- ---------- --------- --------- --------- ---------
ACCT 69547.60 0.00 49704.79 0.00 63237.16 0.0
ADMN 53788.26 108138.43 0.00 0.00 67724.21 122563.4
FINC 61937.62 0.00 95027.55 0.00 57139.90 0.0
MKTG 0.00 0.00 99062.66 56772.95 66426.79 0.0
SALE 89393.40 82442.74 80131.91 60941.54 66352.73 0.0
NA 0.00 0.00 0.00 53772.58 0.00 0.0
```

There are two different levels of missing data in an aggregation:
*some* missing values for one or more aggregated
`variable`

s for which the statistic is computed, and
*all* missing values for the `variable`

so that the
group (cell) defined by one or more `by`

variables has no
data values. Accordingly, there are two different missing data
parameters with `pivot()`

.

`na_remove=TRUE`

: Remove any missing data from a value of the`variable`

that leaves remaining values, then perform the aggregation on the remaining values, reporting how many values are missing, otherwise report the computed statistic as NA (missing). [default is`TRUE`

].`na_by_show=TRUE`

: If all values of`variable`

are missing for a group so that the entire level of the`by`

variables is missing, show those missing cells with a reported value of computed variable*n*as 0, otherwise delete from the output [default is`TRUE`

].

By default, the data frame output of `pivot()`

lists the
number of occurrences of missing data for each group of the variable
over which the statistic is computed. The `na_remove`

parameter specifies the value of the base R parameter `na.rm`

for computations such as for the `mean`

. The parameter value
is either `TRUE`

or `FALSE`

, indicating if NA
(missing) values of the variable should be removed before the
computation proceeds. The default value `na_remove`

is
`TRUE`

, so the missing values are removed, their number of
occurrences reported, and then the computed statistic displayed.

In this example, the variable *Years* has one missing value,
which occurs in the Sales department.

`pivot(d, mean, Years, Dept)`

```
## Dept Years_n Years_na Years_mean
## 1 ACCT 5 0 5.600
## 2 ADMN 6 0 10.167
## 3 FINC 4 0 10.250
## 4 MKTG 6 0 9.833
## 5 SALE 14 1 10.286
## 6 <NA> 1 0 5.000
```

Set `na_remove`

to `FALSE`

to *not*
remove any missing data in a cell with values to be aggregated.
According to the way in which R functions process the data with missing
values entered into the function, the resulting computed value will be
missing if any of the constituent data values are missing.

`pivot(d, mean, Years, Dept, na_remove=FALSE)`

```
## Dept Years_n Years_na Years_mean
## 1 ACCT 5 0 5.600
## 2 ADMN 6 0 10.167
## 3 FINC 4 0 10.250
## 4 MKTG 6 0 9.833
## 5 SALE 14 1 0.000
## 6 <NA> 1 0 5.000
```

`by`

VariablesAnother possibility is to account for missing values of the
categorical `by`

variables for which there are no values for
that cell, all values are missing. For this example, first create a cell
with no values in the data aggregation. Use base R
`Extract[]`

with **lessR** `.()`

to
drop the one male in the Sales department, leaving no data values for
that group. Save the result into the *dd* data frame.

`<- d[.(!(Gender=="M" & Dept=="SALE")), ] dd `

Explicitly set `na_by_show`

to `TRUE`

, the
default value. The group for male sales employees is shown with the
values of the computed variables *n_* and *na_* set to 0
and the values of all other variables necessarily missing.

`pivot(dd, c(mean,median), Years, c(Dept, Gender), na_by_show=TRUE)`

```
## Dept Gender Years_n Years_na Years_mean Years_md
## 1 ACCT M 2 0 7.000 7.0
## 2 ADMN M 2 0 15.500 15.5
## 3 FINC M 3 0 11.333 10.0
## 4 MKTG M 1 0 18.000 18.0
## 5 SALE M 0 0 0.000 NA
## 6 <NA> M 0 0 0.000 NA
## 7 ACCT W 3 0 4.667 3.0
## 8 ADMN W 4 0 7.500 5.0
## 9 FINC W 1 0 7.000 7.0
## 10 MKTG W 5 0 8.200 8.0
## 11 SALE W 5 0 6.600 8.0
## 12 <NA> W 1 0 5.000 5.0
```

Drop the groups from the output with missing data for a
`by`

variable. To do so, set `na_by_show`

to
`FALSE`

. Now the group for the non-existent male sales
employee does not display.

`pivot(dd, c(mean,median), Years, c(Dept, Gender), na_by_show=FALSE)`

```
## Dept Gender Years_n Years_na Years_mean Years_md
## 1 ACCT M 2 0 7.000 7.0
## 2 ADMN M 2 0 15.500 15.5
## 3 FINC M 3 0 11.333 10.0
## 4 MKTG M 1 0 18.000 18.0
## 7 ACCT W 3 0 4.667 3.0
## 8 ADMN W 4 0 7.500 5.0
## 9 FINC W 1 0 7.000 7.0
## 10 MKTG W 5 0 8.200 8.0
## 11 SALE W 5 0 6.600 8.0
## 12 <NA> W 1 0 5.000 5.0
```

Any function that processes a single column of data and returns a
single value can be accessed with `pivot()`

. In this example,
define a function named *mnmd()* that computes the difference
between the `mean`

and `median`

of a distribution
of data values, here represented by `x`

. For technical
reasons, need to include the `NA`

remove parameter,
`na.rm`

, in the function definition. Usually set to the value
of `na.rm`

to `TRUE`

to allow for missing data to
be dropped. If missing data values are to be dropped in internal
function calls, then the base R parameter `na.rm`

needs to be
set for each component function as well.

`<- function(x, na.rm=TRUE) mean(x, na.rm=na.rm) - median(x, na.rm=na.rm) mnmd `

Invoke `pivot()`

to compute the mean, the median, and
their difference for each group in the aggregation of the variable
*Years*.

`pivot(d, c(mean, median, mnmd), Years, by=Dept)`

```
## Dept Years_n Years_na Years_mean Years_md Years_mnmd
## 1 ACCT 5 0 5.600 5.0 0.600
## 2 ADMN 6 0 10.167 6.5 3.667
## 3 FINC 4 0 10.250 10.0 0.250
## 4 MKTG 6 0 9.833 9.0 0.833
## 5 SALE 14 1 10.286 10.0 0.286
## 6 <NA> 1 0 5.000 5.0 0.000
```

`pivot()`

vs `aggregate()`

The **lessR** `pivot()`

function relies upon
the **base R** function `aggregate()`

for
aggregation. By default, except for the `table`

computation,
`pivot()`

generates a long-form data frame pivot table (Excel
terminology), which can then be directly input into analysis and
visualization functions as a standard data frame. The levels across all
the `by`

grouping variables are listed in the rows. If there
are specified column grouping variables according to
`by_cols`

, `pivot()`

relies upon base R
`reshape()`

to form a 2-d table for direct viewing instead of
a data table to input into further analysis functions.

`pivot()`

provides additional features than
`aggregate()`

provides.

- For each
`value`

over which to aggregate, the sample size and number of missing values for each group is provided. - Multiple statistical functions can be selected for which to
`compute`

the aggregated value for each group. - Extends beyond aggregation to
`compute`

statistics over the entire data set instead of groups of data. - Missing data analysis by cell or by the
`value`

aggregated. - Aggregation not necessary, so can
`compute`

the specified statistic(s) for each`variable`

across the entire data set. - The aggregated computations can be displayed as a 2-d table, not just a long-form data frame.
`by`

variables of type`Date`

retain the same variable type in the summary table instead of each converted to a`factor`

(set`factors=FALSE`

to not convert`integer`

and`character`

variables to factors).- The list of parameters lists the
`data`

parameter first, which facilitates the use of the pipe operator, such as from base R as of Version 4.1.0 or the**magrittr**package. - Any non-numeric variables with unique values in the submitted
`data`

are listed with their corresponding data value, which identifies when drilling down into the data to study relevant rows.

Although the `pivot()`

function does considerably extend
the functionality of base R `aggregate()`

,
`pivot()`

does rely upon the base R function for most of its
computations.