This is how you fit regional_mix model in ecomix.

regional_mix(
  rcp_formula = NULL,
  species_formula = NULL,
  data,
  nRCP = 3,
  family = "bernoulli",
  offset = NULL,
  weights = NULL,
  control = list(),
  inits = "random2",
  titbits = TRUE,
  power = 1.6
)

regional_mix.multifit(
  rcp_formula = NULL,
  species_formula = NULL,
  data,
  nRCP = 3,
  family = "bernoulli",
  offset = NULL,
  weights = NULL,
  control = list(),
  inits = "random2",
  titbits = FALSE,
  power = 1.6,
  nstart = 10,
  mc.cores = 1
)

Arguments

rcp_formula

an object of class "formula" (or an object that can be coerced to that class). The response variable (left hand side of the formula) needs to be either 'presence', 'occurrence', 'abundance', 'biomass' or 'quantity' this will help specify the type of data to be modelled, if the response variable is disperate to the model family an error will be thrown. The dependent variables (the right hind side) of this formula specifies the dependence of the region of common profile (rcp) probabilities on covariates.

species_formula

an object of class "formula" (or an object that can be coerced to that class). The left hand side of this formula should be left empty (it is removed if it is not empty). The right hand side of this formula specifies the dependence of the species"'" data on covariates (typically different covariates to rcp_formula to avoid confusing confounding). An example formula is observations ~ gear_type + time_of_day, where gear_type describes the different sampling gears and time_of_day describes the time of the sample. #maybe could call this detection/bias

data

a List which contains named objects 'species_data': a data frame containing the species information. The frame is arranged so that each row is a site and each column is a species. Species names should be included as column names otherwise numbers from 1:S are assigned. And 'covariate_data' a data frame containing the covariate data for each site. Names of columns must match that given in rcp_formula and species_formula.

nRCP

The number of mixing components (groups) to fit.

family

The family of statistical family to use within the ecomix models. a choice between "bernoulli", "poisson", "negative.binomial", "tweedie" and "gaussian" distributions are possible and applicable to specific types of data.

offset

a numeric vector of length nrow( data) that is included into the model as an offset. It is included into the conditional part of the model where conditioning is performed on the unobserved RCP type. Note that offsets cannot be included as part of the rcp_formula or species_formula arguments ??? only through this argument.

weights

a numeric vector of length nrow( data) that is used as weights in the log-likelihood calculations. If NULL (default) then all weights are assumed to be identically 1.

control

a list of control parameters for optimisation and calculation. See details. From control control.

inits

a character string which defines the method used to initialise finite mixture model clustering. #Will have to synergise this function call across RCP and SpeciesMix. Looks like SpeciesMix uses a em.prefit to setup initialisations. regional_mix has a number of methods. This seems like a good place to setup the bivariate clustering step - cobra function.

titbits

either a boolean or a vector of characters. If TRUE (default for regional_mix(qv)), then some objects used in the estimation of the model"'"s parameters are returned in a list entitled "titbits" in the model object. Some functions, for example plot.regional_mix(qv) and predict.regional_mix(qv), will require some or all of these pieces of information. If titbits=FALSE (default for regional_mix.multifit(qv)), then an empty list is returned. If a character vector, then just those objects are returned. Possible values are:"Y" for the outcome matrix, "X" for the model matrix for the RCP model, "W" for the model matrix for the species-specific model, "offset" for the offset in the model, "wts" for the model weights, "form.RCP" for the formula for the RCPs, "form.spp" for the formula for the species-specific model, "control" for the control arguments used in model fitting, "family" for the conditional family of the species data, and "power" for the power parameters used (only used in Tweedie models). Care needs to be taken when using titbits=TRUE in regional_mix.multifit(qv) calls as titbits is created for EACH OF THE MODEL FITS. If the data is large or if nstart is large, then setting titbits=TRUE may give users problems with memory. It is more efficient, from a memory perspective, to refit the "best" model using regional_mix(qv) after identifying it with regional_mix.multifit(qv). See examples for illustration about how to do this.

power

a numeric vector (length either 1 or the number of species) defining the power parameter to use in the Tweedie models. If length(power)==1, then the same power parameter is used for all species. If length(power)==No_species, then each species gets its own power parameter. Power values must be between 1 and 2, for computational reasons they should be well away from the boundary. The default is 1.6 as this has proved to be a good ball-park value for the fisheries data that the developer has previously analysed.

nstart

for regional_mix.multifit only. The number of random starts to perform for re-fitting. Default is 10, which will need increasing for serious use.

mc.cores

for regional_mix.multifit only. The number of cores to spread the re-fitting over.

Value

regional_mix returns an object of class regional_mix and regional_mix.multifit returns a list of objects of class regional_mix. The regional_mix class has several methods: coef, plot, predict, residuals, summary, and vcov. The regional_mix object consists of a list with the following elements:

AIC Akaike an information criterion for the maximised model.

BIC Bayesian information criterion for the maximised model.

call the call to the function.

coefs a list of three elements, one each for the estimates for the species prevalence (alpha), the deviations from alpha for the first (nRCP-1) RCP (tau), and the (nRCP-1) sets of RCP regression coefficients (beta).

conv the convergence code from the maximisation procedure. See ?optim for an explanation (basically 0 is good and anything else is bad).

dist the character string identifying the family used for the model.

logCondDens an nObs by nRCP matrix specifying the probability of observing each sites"'" data, given each of the RCP types.

logl the maximised log likelihood.

mus an array of size nRCP x S x nRCP where each element of the first dimension is the fitted value for all the species in all the RCP types.

n the number of samples.

names the names of the species, and the names of the covariates for X and W.

nRCP the number of RCPs.

pis an n x nRCP matrix with each column giving the prior probabilities for the corresponding RCP type. Rows sum to one.

postProbs an n x nRCP matrix with each column giving the posterior probabilities for the corresponding RCP type. Rows sum to one (as each site is assumed to be from one of the RCP types).

p.w the number of covariates used in the species-specific model.

p.x the number of covariates used in the RCP model

S the number of species.

scores a list of three elements. Structure corresponds to coefs.

start.vals the values used to start the estimation procedure.

titbits (if requested using the titbit argument, see above) other pieces of information, useful to developers, that users should not typically need to concern themselves with. However, this information is used by methods for regional_mix objects.

Details

A typical formula for use in the rcp_formula argument will have the form (for example) cbind(spp1,spp2,spp3,spp4)~1+cov1+cov2*cov3. This signifies that there are 4 species to be used for RCP modelling and that the RCP types are dependent on cov1+cov2+cov3+cov2:cov3. See ?glm for a description of how the right hand side of the formula is expanded.

Likewise a typical formula for use in the species_formula argument will have the form (for example) ~1+fac1+cov1. This signifies that the catchabilities of each species depends upon the levels of the factor fac1 and the covariate cov1. See ?glm for a description of how the right hand side of the formula is expanded.

The computation strategy for the default method, which has been demonstrated to work for all data sets the developers have encountered thus far, is fully described in Foster et al (2013) and Foster et al (2017). We note however, that it is a good idea to standardise covariates prior to calling regional_mix. This is not formally required by the model, but it does drastically reduce the chance of numerical issues in the first iteration. If you choose to NOT standardise, then you should at least choose a scale that is reasonable (so that the numerical range is measured by units and not thousands of units). This may mean that the units may be, for example, kilometres (and not metres), or 100s of kilometres (and not metres/kilometres).

We do not, on purpose, provide residuals as a routine part of the model. Users should use the residuals.regional_mix(qv) function to obtain them. We do this as the type of residual needs to be specified (although we recommend type=="RQR" for routine use).

Control arguments for optimisation generally follow those in optim(qv), although a few differences occur (e.g. "loglOnly"). The elements of the control list are

maxit

The maximum number of iterations. Default is 500.

quiet

Should any reporting be performed? Default is FALSE, for reporting. For regional_mix.multifit(), this indicates if the progress should be printed.

trace

Non-negative integer. If positive, tracing information on the progress of the optimization is produced. Higher values may produce more tracing information.

nreport

The frequency of reports for optimisation. Default is 10 ??? a report for 10th iteration.

reltol

Relative convergence tolerance. The algorithm stops if it is unable to reduce the value by a factor of reltol * (abs(val) + reltol) at a step. Defaults to sqrt(.Machine$double.eps), typically about 1e-8.

optimise

Should optimisation for estimation occur? If TRUE (default) optimisation will occur. If FALSE no optimisation is performed.

loglOnly

Should the log-likelihood be caulcated? If TRUE (default) then log-likelihood is calculated and returned. If FALSE then the log-likelihood is not calculated for return.

derivOnly

Should the scores be evaluated at the (final) parameter values. If TRUE (default) then they are calculated. If FALSE then they are not calculated.

penalty

A numeric scalar. This is the concentration for the Dirichlet-inspired penalty for the prior probabilities. Values less than zero will be set to the default (0.1). Large values give more penalisation than small ones.

penalty.tau

A numeric scalar. This is the penalty for the tau parameters in the species model. They are assumed to come from a normal distribution with standard deviation given as this parameter (default is 10).

penalty.gamma

A numeric scalar. This is the penalty for the gamma parameters in the species model. They are assumed to come from a normal distribution with standard deviation given as this parameter (default is 10).

penalty.disp

a two element vector. These are combined to form the penalty for the dispersion parameters (if any). The dispersions are assumed to come from a log-normal distribution with log-mean penalty.disp[1] and log-standard-deviation penalty.disp[2]. Defaults to c(10,sqrt(10)), which gives shrinkage towards 1 (the mode of the penalty). Note that for Normal models, where the dispersion alone defines the variance, a strong penalty may be required to keep parameters estimable.

For calls to regimix.multifit(), titbits is set to FALSE??? so no excess memory is used. If users want this information, and there is good reason to want it, then a call to regimix() with starting values given as the best fit's estimates should be used.

Examples

if (FALSE) { rcp_form <- as.formula(paste0("cbind(",paste(colnames(simulated_data[,1:20]), collapse = ','),")~1+x1+x2+x3")) spp_form <- observations ~ 1 + w1 + w2 fm_regional_mix <- regional_mix(rcp_formula=rcp_form,species_formula=spp_form, data=data, family='bernoulli', nRCP=5) }