--- title: "Getting started with spicy" description: > Get started with spicy for descriptive statistics, variable inspection, frequency tables, cross-tabulations, association measures, categorical and continuous summary tables, regression coefficient tables, and codebooks in R. A tidyverse-friendly alternative to SPSS and Stata for survey and labelled data workflows. output: rmarkdown::html_vignette: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Getting started with spicy} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) build_rich_tables <- identical(Sys.getenv("IN_PKGDOWN"), "true") ``` ```{r setup} library(spicy) ``` spicy is an R package for descriptive statistics and data analysis, designed for data science and survey research workflows. It covers variable inspection, frequency tables, cross-tabulations with chi-squared tests and effect sizes, and publication-ready summary tables, offering functionality similar to Stata or SPSS but within a tidyverse-friendly R environment. This vignette walks through the core workflow using the bundled [`sochealth`](../reference/sochealth.html) dataset, a simulated social-health survey with 1200 respondents and 24 variables. ## Inspect your data `varlist()` (or its shortcut `vl()`) gives a compact overview of every variable in a data frame: name, label, representative values, class, number of distinct values, valid observations, and missing values. In RStudio or Positron, calling `varlist()` without arguments opens an interactive viewer - this is the most common usage in practice. Here we use `tbl = TRUE` to produce static output for the vignette: ```{r varlist} varlist(sochealth, tbl = TRUE) ``` You can also select specific columns with tidyselect syntax: ```{r varlist-select} varlist(sochealth, starts_with("bmi"), income, weight, tbl = TRUE) ``` ## Frequency tables `freq()` produces frequency tables with counts, percentages, and (optionally) valid and cumulative percentages. ```{r freq} freq(sochealth, education) ``` Weighted frequencies use the `weights` argument. With `rescale = TRUE`, the total weighted N matches the unweighted N: ```{r freq-weighted} freq(sochealth, education, weights = weight, rescale = TRUE) ``` ## Cross-tabulations `cross_tab()` crosses two categorical variables. By default it shows counts, a chi-squared test, and Cramer's V: ```{r crosstab} cross_tab(sochealth, smoking, education) ``` Add percentages with `percent`: ```{r crosstab-pct} cross_tab(sochealth, smoking, education, percent = "col") ``` Group by a third variable with `by`: ```{r crosstab-by} cross_tab(sochealth, smoking, education, by = sex) ``` When both variables are ordered factors, `cross_tab()` automatically selects an ordinal measure (Kendall's Tau-b) instead of Cramer's V: ```{r crosstab-ordinal} cross_tab(sochealth, self_rated_health, education) ``` ## Association measures For a quick overview of all available association statistics, pass a contingency table to `assoc_measures()`: ```{r assoc-measures} tbl <- xtabs(~ smoking + education, data = sochealth) assoc_measures(tbl) ``` Individual functions such as `cramer_v()`, `gamma_gk()`, or `kendall_tau_b()` return a scalar by default. Pass `detail = TRUE` for the confidence interval and p-value: ```{r cramer-detail} cramer_v(tbl, detail = TRUE) ``` ## Summary tables `table_categorical()` covers grouped or one-way summary tables for categorical variables: ```{r table-categorical-tt, eval = build_rich_tables} table_categorical( sochealth, select = c(smoking, physical_activity, dentist_12m), by = education, output = "tinytable" ) ``` `table_continuous()` summarizes continuous variables, either overall or by a categorical `by` variable, and can also add group-comparison tests: ```{r table-continuous} table_continuous( sochealth, select = c(bmi, life_sat_health), by = education ) ``` `table_continuous_lm()` covers the same reporting territory when you want to stay in a linear-model framework, for example with robust or cluster-robust standard errors, case weights, or additive covariate adjustment: ```{r table-continuous-lm} table_continuous_lm( sochealth, select = c(wellbeing_score, bmi), by = sex, vcov = "HC3" ) ``` `table_regression()` reports the full coefficient table for one or several fitted `lm()` or `glm()` models, with APA-aligned formatting, factor grouping with reference rows, robust variance, standardised coefficients, average marginal effects, hierarchical comparisons, and side-by-side multi-model layouts: ```{r table-regression} fit <- lm(wellbeing_score ~ age + sex + smoking, data = sochealth) table_regression(fit) ``` For detailed guidance, see the dedicated articles on `table_categorical()`, `table_continuous()`, `table_continuous_lm()`, `table_regression()`, and the final reporting overview for APA-style summary tables. ## Row-wise summaries `mean_n()`, `sum_n()`, and `count_n()` compute row-wise statistics across selected columns, with automatic handling of missing values. ```{r mean-n} sochealth |> dplyr::mutate( mean_sat = mean_n(select = starts_with("life_sat")), sum_sat = sum_n(select = starts_with("life_sat"), min_valid = 2), n_missing = count_n(select = starts_with("life_sat"), special = "NA") ) |> dplyr::select(starts_with("life_sat"), mean_sat, sum_sat, n_missing) |> head() |> as.data.frame() ``` ## Learn more - See `?varlist` to inspect variables, labels, values, and missing data. - See `?freq` for one-way frequency tables (weights, sorting, custom missing values, labelled-data display modes). - See `?cross_tab` for the full list of arguments (weights, simulation, association measures). - See `?assoc_measures` for the complete list of association statistics; `?cramer_v` for the canonical entry point. - See `?table_categorical` for grouped or one-way categorical tables. - See `?table_continuous` for continuous summaries and group comparisons. - See `?table_continuous_lm` for model-based mean-comparison tables with robust / cluster-robust / bootstrap / jackknife SE, case weights, or additive covariate adjustment. - See `?table_regression` for `lm` / `glm` coefficient tables with APA-aligned formatting, robust variance, standardised coefficients, average marginal effects, hierarchical comparisons, and side-by-side multi-model layouts. - See `?mean_n`, `?sum_n`, `?count_n` for row-wise summaries with optional minimum-valid-values rules. - See `?code_book` to generate an interactive HTML codebook; `?label_from_names` to derive variable labels from `"code. label"`-style column names (e.g., LimeSurvey exports).