USER GUIDE  

Five things users should know before using BioVinci

Drag-and-drop

Drag-and-drop is the basis for operating in BioVinci. Draggable items are columns of your tabular data or even the whole table.  

Use existing templates

Sometimes, bringing data to a ready-to-process format can be time consuming. To save users’ time, we have provided a list of commonly used templates (examples). You can use these templates to have a sense of the data structures that are valid for an analysis.

All details are customizable

Forget about using another SVG/photo editor to fine-tune your plots. BioVinci allows you to customize your plots in detail, including ticks, scales, rotations, fonts, sizes, colors, styles, etc. You can visit the Edit functions section to see what the software can do.

Your data are safe

We always treat your data with the highest level of security and privacy. First, BioVinci does not change the original inputs (your XLXS or CSV) but instead operates on a copied version of the input files. Then, in the app version, these copied data are stored in a default BioTData directory in your home folder.  If you want to backup, archive, transfer or copy all of your BioVinci data you’ll need to copy this entire folder. ( Later, we hope to allow you to easily select individual projects).

Speak out about your needs

Behind BioVinci is a team of agile computational gurus. If you need a new feature, please don’t hesitate to drop us a line at: info@bioturing.com 

SPECIAL NOTE FOR MACINTOSH USERS.  In order to access certain features, you will need to perform a “right-click”.  This may depend on the system you are using so please select “System preferences…” in the apple menu and then choose “mouse” or “trackpad”.  Then make sure that “secondary click” is enabled (for trackpad users for example, this is under the “Point & Click” subsection and requires a two-finger click).


Basic plots

There are 8 types of basic plot in BioVinci. Each plot type has a different set of placeholders, which is on the left-side (next to the data tree). To create a plot, user have to drag column(s) from the data tree to the placeholder of a plot type.

Each plot type also has a unique set of options, which can be assessed by using right-click on the plot area. For Macintosh users, this may depend on the system you are using so please select “System preferences…” in the apple menu and then choose “mouse” or “trackpad”.  Then make sure that “secondary click” is enabled (for trackpad users for example, this is under the “Point & Click” subsection and requires a two-finger click).

Density and Histogram

To illustrate how to create different types of histogram and density plots, we use the following example data. The table contains concentration of 6 markers (A to F) from 8 patients in three benign/malignant tissues (liver, pancreas, and kidney).

         

Sample

Tissue

Phenotype

Conc A

Conc B

Conc C

Conc D

Conc E

Conc F

Patient A

Liver

Benign

2.05

1.59

1.63

24.73

3.54

1.66

Patient A

Pancreas

Benign

2.29

0.92

1.95

21.21

8.55

0.91

Patient B

Liver

Benign

0.84

1.08

0.13

20.06

8.68

1.34

Patient C

Kidney

Benign

2.92

1.43

1.08

22.87

7.14

2.01

Patient C

Liver

Benign

1.87

0.99

1.11

20.28

2.96

0.73

Patient C

Pancreas

Benign

1.38

0.82

1.46

21.88

1.32

1.04

Patient D

Kidney

Benign

0.43

1.75

0.06

21.2

2.93

2.06

Patient D

Pancreas

Benign

1.06

0.79

0.23

24.88

2.66

0.78

Patient E

Kidney

Malignant

1.06

1.41

0.38

20.21

6.52

0.52

Patient E

Liver

Malignant

1.14

0.82

0.34

21.04

5.77

1.51

Patient F

Kidney

Malignant

1.02

1.28

0.13

21.15

1.87

3.16

Patient G

Kidney

Malignant

1.33

1.22

0.51

21.58

1.23

0.92

Patient G

Pancreas

Malignant

0.58

0.7

0.41

24.39

9.37

1.07

Patient H

Kidney

Malignant

2.56

1.82

0.89

20.81

1.91

0.33

Patient H

Liver

Malignant

1.75

1.37

1.41

23.58

10.12

2.22

Patient H

Pancreas

Malignant

0.3

0.44

1.23

21.98

10.83

0.08

Types

Single-group

A single-group histogram / density plot shows the distribution of only one variable.

To create this kind of plot, you can drag a numerical variable to the Value placeholder. In this example, we use column Conc A.

If you want to create a density plot for Conc A instead, you can customize the plot options using right-click (please see special note for Macintosh users). We will explain the these options in the next section.

Multiple-group histogram

A multiple-group histogram / density plot shows the distribution of two or more variables.

Case 1

You can use a numerical variable as Value and a categorical variable as Color.  In this example, we used Conc A and Phenotype, respectively. The software created 2 histogram/density curves for the two groups defined in Phenotype.

Case 2

You can put multiple numerical variables in the Value placeholder to create a multiple-group histogram/density plot. In this example, we use Conc A and Conc B for Value.

Split mode

You can split your histogram into different groups by using a categorical variable as Split by. In this example, we use Conc A, Phenotype, and Tissue as Value, Color, and Split by, respectively. The software first split the data into three parts, which are Liver, Pancreas, and Kidney. Then it created the plot separately for each part.

Options

Components

This option controls the component of the plot, whether it should be a histogram, or a density plot, or both.

Histogram (default)

Your histogram will display the frequency of different ranges in bars.

Density

Your plot will show a curve instead. This is the kernel density[a][b] curve of your data. You can turn on both histogram and density plot options.

Position

Where users can decide how different groups appear on the histogram.

Overlap (default)

All the groups will overlap each other and appear in a lexicographical order.

Fill

All the group will stack up to 100 percent in lexical order.

Stack

It stacks all groups similarly to the Fill position but does not scale the total value to 100 percent.

Annotation

You can add a vertical line to the plot to mark the mean or median with this option.

Normalization

This option controls how the software calculates the height of the bars in the histogram. Thus, you can only see changes if the Histogram (in Components) is activated.

Normal (default)

The height of a bar is equal to the counts of the value in its range.

Percent

The height of a bar is equal to the proportion of the counts. The total of the bars’ heights is equal to 100.

Probability

The height of a bar is equal to the proportion of the counts. The value is scaled to 1. Thus, the total of the bars’ heights is equal to 1.

Density

The height of a bar is equal to the density of the value in its range. It is equal to the counts divided by the range’s size.

Probability density

The height of a bar is equal to the probability of the density. It is equal to the density divided by the number of values in the variable.

Scale Y

This option allows application of  log transform on the Y axis. Linear, which is the normal scale, is set by default.

Bins

At the moment, user cannot change the number of bins (for histogram) or the size of the bins. The number and size of bins follows the Sturge’s Rule, with a maximum of 50 bins.

Pie Chart

Pie charts present data in proportions where the arc length of each slice of the chart is proportional to the quantity it represents. In this section, we will illustrate how to construct pie charts using a familiar example data below.

Sample

Tissue

Phenotype

Conc A

Conc B

Conc C

Conc D

Conc E

Conc F

Patient A

Liver

Benign

2.05

1.59

1.63

24.73

3.54

1.66

Patient A

Pancreas

Benign

2.29

0.92

1.95

21.21

8.55

0.91

Patient B

Liver

Benign

0.84

1.08

0.13

20.06

8.68

1.34

Patient C

Kidney

Benign

2.92

1.43

1.08

22.87

7.14

2.01

Patient C

Liver

Benign

1.87

0.99

1.11

20.28

2.96

0.73

Patient C

Pancreas

Benign

1.38

0.82

1.46

21.88

1.32

1.04

Patient D

Kidney

Benign

0.43

1.75

0.06

21.2

2.93

2.06

Patient D

Pancreas

Benign

1.06

0.79

0.23

24.88

2.66

0.78

Patient E

Kidney

Malignant

1.06

1.41

0.38

20.21

6.52

0.52

Patient E

Liver

Malignant

1.14

0.82

0.34

21.04

5.77

1.51

Patient F

Kidney

Malignant

1.02

1.28

0.13

21.15

1.87

3.16

Patient G

Kidney

Malignant

1.33

1.22

0.51

21.58

1.23

0.92

Patient G

Pancreas

Malignant

0.58

0.7

0.41

24.39

9.37

1.07

Patient H

Kidney

Malignant

2.56

1.82

0.89

20.81

1.91

0.33

Patient H

Liver

Malignant

1.75

1.37

1.41

23.58

10.12

2.22

Patient H

Pancreas

Malignant

0.3

0.44

1.23

21.98

10.83

0.08

Types

Standard pie chart

This is the most common pie chart: a single circle with one or more colors. You can create this kind of plot in many ways.

Case 1 - Using a non-numerical variable:

To see the proportion of different tissues used in the experiments, we can drag the Tissue variable into the Value placeholder. The software will count the number of times each value appears in that variable to calculate its proportion.

Case 2 - Using multiple numerical variables (vectors)

You can use a numerical column as Value and a categorical column as Color to split this column into multiple numerical vectors. The software will add up the values based on the group assigned in Color.

Split mode

You can create multiple pie plots at once by using the Split by placeholder with a categorical variable (column). The software will split the data to multiple pie charts using the grouping information in that new categorical variable.

Options

Text info

You can choose among three options below to select what information should be presented on the plot.

Group (default)

This is the group’s name.

Value

This is the counts of each categorical value (Case 1) or the totals of each group (Case 2)

Percent

This is the percentage of each group’s values.

The picture below shows a pie chart with all these three options activated.

Hover info

All options in this setting have the same meanings as those of the Text info. Once selected, the information will appear when you hover your mouse over the graph.

Sort

This option determines whether the categories should be sorted based on the percentage (descending).

Donut chart

If you activate this option, the software will create a hole in the middle of the pie chart. The diameter of this hole is exactly one third of the pie.

Bar plot

A bar plot shows comparisons among discrete categories. One axis of the plot shows the specific categories being compared, and the other axis represents a measured value, which can be frequency, absolute value, percentage, mean, or median. We now explore different ways to create a bar plot in BioVinci using the familiar example data below.

Sample

Tissue

Phenotype

Conc A

Conc B

Conc C

Conc D

Conc E

Conc F

Patient A

Liver

Benign

2.05

1.59

1.63

24.73

3.54

1.66

Patient A

Pancreas

Benign

2.29

0.92

1.95

21.21

8.55

0.91

Patient B

Liver

Benign

0.84

1.08

0.13

20.06

8.68

1.34

Patient C

Kidney

Benign

2.92

1.43

1.08

22.87

7.14

2.01

Patient C

Liver

Benign

1.87

0.99

1.11

20.28

2.96

0.73

Patient C

Pancreas

Benign

1.38

0.82

1.46

21.88

1.32

1.04

Patient D

Kidney

Benign

0.43

1.75

0.06

21.2

2.93

2.06

Patient D

Pancreas

Benign

1.06

0.79

0.23

24.88

2.66

0.78

Patient E

Kidney

Malignant

1.06

1.41

0.38

20.21

6.52

0.52

Patient E

Liver

Malignant

1.14

0.82

0.34

21.04

5.77

1.51

Patient F

Kidney

Malignant

1.02

1.28

0.13

21.15

1.87

3.16

Patient G

Kidney

Malignant

1.33

1.22

0.51

21.58

1.23

0.92

Patient G

Pancreas

Malignant

0.58

0.7

0.41

24.39

9.37

1.07

Patient H

Kidney

Malignant

2.56

1.82

0.89

20.81

1.91

0.33

Patient H

Liver

Malignant

1.75

1.37

1.41

23.58

10.12

2.22

Patient H

Pancreas

Malignant

0.3

0.44

1.23

21.98

10.83

0.08

Types

Standard bar plot

Case 1: Histogram of Frequency

Let’s first explore how each patient is involved in different measurements of the experiment. By simply dragging the “Sample” variable into the X placeholder, we obtain a histogram below.

[c]

If you use only one categorical column as the X, the software will calculate the frequency, or how many times a value appears in that column, and use this data for the Y.

Case 2: Histogram from numerical variables

[d][e]

We now would like to compare the mean concentration of biomarkers A, B, C in all the experiments. By simply dragging 3 variables - Conc A, Conc B, and Conc C - to the Y placeholder, a histogram showing the mean concentration of A, B, and C is created. Error bars can be created using right-click options.

Case 3: Histogram from a numerical column and a categorical column

Another way to create a single group bar plot is to use both X and Y. This is perhaps the most common case. Users pick a numerical column for Y and a categorical column to X. The software will calculate the mean or median (depending on users’ options) for each group defined by X.

Grouped bar plot

The grouped bar plot has an additional group factor beside the one on the X axis. Sometimes people call it the two-way bar plot, which is helpful to show two-way comparisons. You can construct this kind of plot with either of the two methods below.

Case 1

User uses a categorical column as X, a numerical column as Y, and another categorical column as Color.

Case 2

Another way is to use a categorical value as X and multiple numeric columns as Y. The software will classify values of these numerical columns (in Y) into their categories defined in X, and build the bars for each category.

Split mode

You can create a split bar plot by using an additional categorical column for the Split by placeholder. The software will split the data first, then generate each bar plot separately.

Options

Summary method

This option determines how the software calculates the bar’s height. Users can choose mean, median, or sum (default).

Error bar

This option controls how to visualize the error bar, whether it should represent the standard deviation (SD) or the standard error of the mean (SEM).  You can only see this option works when the there are replications in each category on the x-axis.

Above is a grouped bar plot with Mean and SD.

Orientation

This option determines the orientation of the bars. If it is horizontal, the software basically swaps between X and Y axes.

Position

This option determines the position of bars in a Grouped bar plot.

Side by side (default)

Bars are placed next to each other normally as shown in the previous pictures in this section.

Stack

Bars are placed on top of each other.

Symmetric error bar

This option determines the shape of the error bar, whether it has two arms or only an upper arm.

Show labels

This option shows/hides the value on each bar.

Scale Y

Users choose whether the software should apply log transform for the bars’ heights.

Line plot

According to Wikipedia, a line chart or line plot is a type of chart that displays information as a series of data points, called 'markers,' connected by straight line segments. A line plot is similar to a scatter plot except that the measurement points are ordered (typically by their x-axis value) and joined with straight line segments. A line chart is often used to visualize a trend in data over intervals of time – a time series – thus the line is often drawn chronologically. This section will show you how to create different types of line plots. All the pictures use the example data below.                                   

Time

Phenotype

Conc A

Conc B

Conc C

0

Benign

0.24

0.85

1.54

0

Benign

0.3

0.71

1.06

0

Malignant

1.39

0.86

0.89

0

Malignant

0.82

0.99

0.58

6

Benign

0.73

1.46

1.39

6

Benign

1.08

0.47

1.26

6

Malignant

0.83

1.04

1.34

6

Malignant

1.11

0.77

0.51

12

Benign

1.08

1.17

2.48

12

Benign

1.11

1.96

1.98

12

Malignant

1.75

1.24

1.45

12

Malignant

2.25

0.29

1.94

18

Benign

0.85

0.63

1.42

18

Benign

1.8

1.69

1.58

18

Malignant

1.94

2.57

1.01

18

Malignant

2.39

2.63

1.99

24

Benign

2.17

0.83

3.06

24

Benign

2.61

1.44

3.21

24

Malignant

2.09

1.05

1.42

24

Malignant

1.62

2.86

2.82

Types

Standard line plot

To see how Conc A changes by time, user may drag Time to X and Conc A to Y, resulting in the following graph.

 Case 1

Users can put a numerical column in Y and a column (regardless of its type) in X. The software will draw a line that simply connects all the data points in increasing order of the X coordinate.

But this is clearly not the plot that you want. As your data contain replications for each time, you would like to see how the mean or median of Conc A changes over time. A more proper plot should look like below.

In this case, you should change the X column to categorical. BioVinci will then calculate the mean/median at each value of X and draw a line that goes through each of them. As X is now categorical, the ordering for connecting all the points of the plot will be inferred from their order in the column.  

Case 2

If users provide just a numerical column to the Y placeholder, the software will use the indices as X.

Multiple line plot

Case 1

Users can pick multiple numerical columns for Y to create multiple lines for the line plot. The X axis will contain the indices.

Case 2

Users can choose a numerical column for Y, any kind of column for X, and a categorical column for Color.

Type of column for X is very important, as described in the Standard line plot.

Split mode

To create a split line plot, use an additional categorical column for the Split by placeholder. The software will split the data first, then generate each line plot separately.

Options

Summary method

This option determines how the software calculates Y values if X is categorical. Users can switch between mean and median. The suffixes _se and _sd stand for standard error and standard deviation, respectively. Changing these options tells the software to add the error bar on the plot using the provided information.

Error plot

This option controls how to visualize the error bar, whether it should have one or two arms, with or without points. You may not see the error bar change even when you have switched this option because the Summary method is set to be mean by default, which provides no information for the error bar. You can only see this option work when the Summary method is switched to an option with a suffix of _sd or _se.

The figure above illustrates a Case 2 multiple line plot with mean_sd and lower_pointrange.

Add

You can use scattered points (“jitter”) or a box (“boxplot”) to make your line plot more informative.

With point

With these options (“line” and “line + point”), users can choose to show points at every value on X.

Smoothness

This option determines whether the line plot should use spline interpolations instead of straight lines.

Scale X

Users choose whether the software should apply one of several log transforms for the lines.

Scale Y

Users choose whether the software should apply one of several log transforms for the lines.

Scatter plot

According to Wikipedia, a scatter plot (also called a scatter plot, scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two or three variables for a set of data.This section shows how to create different types of scatter plots using the data below.

Time

Phenotype

Conc A

Conc B

Conc C

0

Benign

0.24

0.85

1.54

0

Benign

0.3

0.71

1.06

0

Malignant

1.39

0.86

0.89

0

Malignant

0.82

0.99

0.58

6

Benign

0.73

1.46

1.39

6

Benign

1.08

0.47

1.26

6

Malignant

0.83

1.04

1.34

6

Malignant

1.11

0.77

0.51

12

Benign

1.08

1.17

2.48

12

Benign

1.11

1.96

1.98

12

Malignant

1.75

1.24

1.45

12

Malignant

2.25

0.29

1.94

18

Benign

0.85

0.63

1.42

18

Benign

1.8

1.69

1.58

18

Malignant

1.94

2.57

1.01

18

Malignant

2.39

2.63

1.99

24

Benign

2.17

0.83

3.06

24

Benign

2.61

1.44

3.21

24

Malignant

2.09

1.05

1.42

24

Malignant

1.62

2.86

2.82

Types

Single group scatter plot

Case 1

This is the basic scatter plot. All you need is X and Y. Any kind of columns will work.

Case 2

This is the 3D scatter plot. Similar to Case 1, you just need to add one more column to Z. But all of the X, Y, and Z data has to be numerical.

Multiple group scatter plot

Users can add other grouping factors to the scatter plot by using Color or Size. The two options require a categorical column and a numerical column, respectively. They work on both 2D and 3D scatter plots.

Split mode

To create a split scatter plot, you can drag an additional categorical column to the Split by placeholder. The software will split the data first, then generate each scatter plot separately. This function only works for a 2D scatter plot.  (In this example, we split the data with the “Time” column.  In the left side of the screen under “Source” we changed it to alpha A-Z data so that it would could be dragged over to “Split by”)  

Tooltip

Tooltip controls the information that pops up when users hover the mouse over an item. By default, it displays the information of the columns that contribute to the plot. Users can make it more informative by dragging any column to this placeholder.

Options

Regression type

This option performs a regression on the plot based on the data points. If there is a group factor at Color, the software will generate a regression for each group. Regression is currently only available for 2D scatter plots. In the picture above, we applied linear regression on a scatter plot of Conc A and Con B, with Phenotype as Color.

95% Confidence Ellipse

Users can also add a 95% confidence ellipse based on the data points using the stat_ellipse package in ggplot2. If there is a group factor at Color, the software will generate an ellipse for each group. 95% confidence ellipse is only available for 2D scatter plots. In the picture above, we applied T-distribution on a scatter plot of Conc A and Con B, with Phenotype as Color.

95% CI for regression

This options determines whether the regression line should have two 95% confidence interval splines. You can only see this option works when there is a regression line on the plot.

Scale X

Users choose whether the software should apply one of several log transforms for the points.

Scale Y

Users choose whether the software should apply one of several log transforms for the points.

Violin plot

A violin plot is a common alternative to the box plot. Instead of showing the data quartiles with a whisker, it reveals a full distribution with two kernel density curves on both sides. This unique feature forms the violin-like shape and makes this plot much more informative. In this section, we will show you how to create different types of violin plots using the example data below:

Sample

Tissue

Phenotype

Conc A

Conc B

Conc C

Conc D

Conc E

Conc F

Patient A

Liver

Benign

2.05

1.59

1.63

24.73

3.54

1.66

Patient A

Pancreas

Benign

2.29

0.92

1.95

21.21

8.55

0.91

Patient B

Liver

Benign

0.84

1.08

0.13

20.06

8.68

1.34

Patient C

Kidney

Benign

2.92

1.43

1.08

22.87

7.14

2.01

Patient C

Liver

Benign

1.87

0.99

1.11

20.28

2.96

0.73

Patient C

Pancreas

Benign

1.38

0.82

1.46

21.88

1.32

1.04

Patient D

Kidney

Benign

0.43

1.75

0.06

21.2

2.93

2.06

Patient D

Pancreas

Benign

1.06

0.79

0.23

24.88

2.66

0.78

Patient E

Kidney

Malignant

1.06

1.41

0.38

20.21

6.52

0.52

Patient E

Liver

Malignant

1.14

0.82

0.34

21.04

5.77

1.51

Patient F

Kidney

Malignant

1.02

1.28

0.13

21.15

1.87

3.16

Patient G

Kidney

Malignant

1.33

1.22

0.51

21.58

1.23

0.92

Patient G

Pancreas

Malignant

0.58

0.7

0.41

24.39

9.37

1.07

Patient H

Kidney

Malignant

2.56

1.82

0.89

20.81

1.91

0.33

Patient H

Liver

Malignant

1.75

1.37

1.41

23.58

10.12

2.22

Patient H

Pancreas

Malignant

0.3

0.44

1.23

21.98

10.83

0.08

Types

Standard violin plot

This violin plot has a categorical X axis, a numerical Y axis, and no other grouping factor(s).

Case 1

        

This kind of violin plot requires one or more numerical columns for the Y placeholder. Each column in Y will generate one violin on the plot.

Case 2

You can also use a categorical column as X and a numerical column as Y. Each group defined in X will create a violin using the corresponding values in Y.[f][g]

Grouped violin plot

This violin has an additional grouping factor, which determines the color of the violin plot.

Case 1

You can create this plot by using a categorical column for X and multiple numerical columns for Y.

Case 2

You can use a categorical column for X, a numerical column for Y, and a categorical column for Color. For each group defined by X, the software will create subgroups defined by Color.

Split mode

You can create a split violin plot by using an additional categorical column for the Split by placeholder. The software will split the data first, then generate each violin plot separately.

Options

Orientation

This option determines the orientation of the violin. It basically swaps the X and Y axes.

Points

This option allows users to show or hide all the data points or just outliers.

Point position

This option controls positioning the data points, whether they should lie right on the violin or next to it. You will not see any changes unless you activate the Points option.

Scale mode

Scale mode controls how to calculate the widths of the violins.

True density

The width of each violin depends on the number of data points.

Equal area

All the violins within the same subgroup (not group) will have the same area.

Show box

Activate this option to add a box to each violin.

Binary blend

Activate this option to fuse two violins into one. It can only work with a Grouped violin plot and with only two subgroups.

Scale Y

This option allows you to apply one of the log transforms for the Y values before generating the violin plot.

Box plot

A box plot  is a method for graphically depicting groups of numerical data through their quartiles. It reveals more information about the distribution than the bar plot by showing the quartiles. This section illustrates how to create different kinds of box plots using the following data.

Sample

Tissue

Phenotype

Conc A

Conc B

Conc C

Conc D

Conc E

Conc F

Patient A

Liver

Benign

2.05

1.59

1.63

24.73

3.54

1.66

Patient A

Pancreas

Benign

2.29

0.92

1.95

21.21

8.55

0.91

Patient B

Liver

Benign

0.84

1.08

0.13

20.06

8.68

1.34

Patient C

Kidney

Benign

2.92

1.43

1.08

22.87

7.14

2.01

Patient C

Liver

Benign

1.87

0.99

1.11

20.28

2.96

0.73

Patient C

Pancreas

Benign

1.38

0.82

1.46

21.88

1.32

1.04

Patient D

Kidney

Benign

0.43

1.75

0.06

21.2

2.93

2.06

Patient D

Pancreas

Benign

1.06

0.79

0.23

24.88

2.66

0.78

Patient E

Kidney

Malignant

1.06

1.41

0.38

20.21

6.52

0.52

Patient E

Liver

Malignant

1.14

0.82

0.34

21.04

5.77

1.51

Patient F

Kidney

Malignant

1.02

1.28

0.13

21.15

1.87

3.16

Patient G

Kidney

Malignant

1.33

1.22

0.51

21.58

1.23

0.92

Patient G

Pancreas

Malignant

0.58

0.7

0.41

24.39

9.37

1.07

Patient H

Kidney

Malignant

2.56

1.82

0.89

20.81

1.91

0.33

Patient H

Liver

Malignant

1.75

1.37

1.41

23.58

10.12

2.22

Patient H

Pancreas

Malignant

0.3

0.44

1.23

21.98

10.83

0.08

Types

Standard box plot

Case 1

You can create boxplots by putting one or more numerical columns into the Y placeholder. Each column in Y generates one boxplot.

Case 2

You can also use a categorical column as X and a numerical column as Y. Each group in X will create a box using the corresponding values in Y.

Grouped box plot

This box plot has an additional grouping factor, which determines the color of the box.

Case 1

You can create this plot by using a categorical column for X and multiple numerical columns for Y.

Case 2

You can use a categorical column for X, a numerical column for Y, and a categorical column for Color. For each group defined by X, the software will create subgroups defined by the Color.

Split mode

You can create a split box plot by using an additional categorical column for the Split by placeholder. The software will split the data first, then generate each box plot separately (the Theme will automatically be changed to “Classic”).

Options

Orientation

This option determines the orientation of the boxes. It basically swaps the X and Y axes.

Points

You can choose to show/ hide the data points here or even just the outliers.

Point position

This option controls the position of the data points, whether they should lie right on the box or next to it. You will not see any changes unless you activate the Points option.

Mean and SD

This options can add annotations that tell the mean and SD of each box. The picture above shows how these annotation looks like in action.

Scale Y

This option allows applying one of the log transforms for the Y values before generating the box plot.

Venn diagram

A Venn diagram is a powerful visualization tool that describes the relationship among a finite number of sets and their intersections. The size of the circles in this diagram approximately represents the cardinality of those sets. This section shows you the way to create a Venn diagram. All the pictures come from the example data below.

                                   

Patient ID

Group

Gender

Age group

Respond 1

Respond 2

Respond 3

1

Alzheimer

Male

Adult

2

3

9

2

Parkinson

Male

Adult

3

6

6

3

Multiple sclerosis

Female

Teenager

3

7

6

4

Alzheimer

Male

Elder

3

4

8

5

Parkinson

Male

Adult

3

5

9

6

Multiple sclerosis

Female

Adult

2

5

6

7

Multiple sclerosis

Male

Adult

5

4

8

8

Multiple sclerosis

Female

Teenager

5

6

8

9

Alzheimer

Female

Elder

3

3

5

10

Alzheimer

Female

Elder

3

3

9

11

Parkinson

Male

Elder

2

6

6

Inputs

You can input the data in two ways.

Type 1

You can provide a column of any kind to the Value placeholder and a categorical column to the Color placeholder. The software calculates intersections among all the groups and generates the Venn diagram.

Type 2

You can drag multiple columns of the same kind to the Value placeholder. In this case, each column will form a group in the Venn diagram.

Types

Venn diagrams can belong to one of the following types, depending on the intersection of the input sets.

Partially Overlapped Venn Diagrams

This is the Venn diagram shown in the previous pictures. Each group contains items that are in common with at least one other group.

Non-intersecting Diagrams

If the sets have no intersection, the software will generate separate circles. In the picture below, the two groups, male and female, have nothing in common.

Identical Sets

If two sets contain exactly the same items, two circles will overlap each other. In the picture below, we have two equal sets, male and female.

Subsets

If one set includes the other(s), the circle(s) of the subset(s) will lie completely inside that of the superset. In the picture below, Multiple Sclerosis eclipses two equal sets, Alzheimer and Parkinson.

Options

Show text

This command determines what information to show on the Venn diagram, including the group name (Set label), group size (Set size), and the intersection size (Intersection size). The picture below has all three options activated.

Multiset

By default, BioVinci will create for each set a subset that has no two items with the same value, then construct the Venn diagram. Thus the intersection will not contain any repetitive values.

Eg: giving two sets: A = {1, 2, 2, 2, 3} and B = {2, 2, 3, 3}

Correspondingly, the software will create A’={1,2,3} and B’= {2,3}

And the intersection will be {2,3}

But users can turn off this setting, by turning on the Multiset option. Given two multi sets, A = {1, 2, 2, 2, 3} and B = {2, 2, 3, 3}, the intersection will be {2, 2, 3}.

Ignore missing values

It defines whether empty cells should be considered a value.


Statistical functions

A. Basic statistics

For impatient people

1. Prepare your datasheet

Sample

Tissue

Phenotype

Conc A

Conc B

Conc C

Conc D

Conc E

Conc F

Patient A

Liver

Benign

2.05

1.59

1.63

24.73

3.54

1.66

Patient A

Pancreas

Benign

2.29

0.92

21.21

8.55

0.91

Patient B

Liver

Benign

0.84

1.08

20.06

8.68

1.34

Patient C

Kidney

Benign

1.43

1.08

22.87

7.14

2.01

Patient C

Liver

Benign

0.99

1.11

20.28

2.96

0.73

Patient C

Pancreas

Benign

1.38

0.82

1.46

21.88

1.32

1.04

Patient D

Kidney

Benign

0.43

1.75

0.06

21.2

2.93

2.06

Patient D

Pancreas

Benign

1.06

0.23

24.88

2.66

0.78

Patient E

Kidney

Malignant

1.06

1.41

0.38

20.21

6.52

0.52

Patient E

Liver

Malignant

1.14

0.82

0.34

21.04

5.77

1.51

Patient F

Kidney

Malignant

1.02

1.28

0.13

21.15

1.87

3.16

Patient G

Kidney

Malignant

1.33

1.22

0.51

1.23

0.92

Patient G

Pancreas

Malignant

0.58

0.41

24.39

9.37

1.07

Patient H

Kidney

Malignant

2.56

1.82

0.89

20.81

1.91

0.33

Patient H

Liver

Malignant

1.75

1.37

1.41

23.58

10.12

Patient H

Pancreas

Malignant

0.3

0.44

1.23

21.98

0.08

2. Import it

3. Choose the Analysis tab and opt for Basic statistics function

4. Drag your data to the placeholder, then hit Run

5. Get the results

Inputs

The function basic statistics can process any kinds of inputs, from categorical columns, numerical columns to matrices. You can drag one or multiple columns, or a matrix that you want to investigate to the A column/matrix placeholder.

Parameters

Transpose is the only option available for basic statistics. This option allows you to transpose your matrix first before conducting the analyses. This option only has effects on the matrix input.

Results

The types of your inputs determine how the results will be presented. If you input a matrix, the function will be applied to each column of the matrix. 

Visualizations

Histogram (for a numerical column)

If you input a numerical column, BioVinci will display the distribution of your data values in a histogram..

Bar plot of frequency (for a categorical column)

If you input a categorical column, you will see the frequency of your categorical data values presented in a bar plot.

Mixed plot (for a matrix)

If your inputs are in a matrix, each column will form a separate subplot (as shown in the figure above) that can be either a histogram or a bar plot, corresponding to the type of that column. This plot only appears when the input is a matrix and has less than 30 columns.

Heatmap of missing values (for a matrix)

This heatmap displays the types of data value in each cell of your datasheet (blue for categorical data, red for numerical data, and black for missing values).

Tables

Descriptive Statistics Table

This table presents all the descriptive statistics for each input numerical column.

Table of frequency

This table shows the frequency of each categorical data value, including the missing value. Therefore, it only appears if the input has at least one categorical column.

For a matrix

When the input is a matrix, BioVinci will generate either a descriptive statistics table or a frequency table for each column, depending on the type of the column.

B. Normality test

In statistics, the normality test can help the analysts evaluate whether a dataset is well-modeled by a normal distribution. It is a prerequisite for many parametric tests holding the normal distribution as the underlying assumption.

For impatient people

1. Here are the example data

Group

Value

A

0.28

A

0.02

A

0.28

A

0.85

A

0.61

A

0.98

A

0.72

A

1.3

A

0.29

A

0.92

B

0.32

B

0.69

B

0.51

B

0.9

B

0.75

B

0.69

B

0.57

B

0.8

B

0.45

B

0.64

2. Import your data to BioVinci

3. Choose the Analysis tab and opt for Normality test

4. Drag your data to the Value placeholder

5. Get the result

Inputs

Below are all the valid data structures for a normality test in BioVinci.

Type 1

Group

Value

A

0.28

A

0.02

A

0.28

A

0.85

A

0.61

A

0.98

A

0.72

A

1.3

A

0.29

A

0.92

B

0.32

B

0.69

B

0.51

B

0.9

B

0.75

B

0.69

B

0.57

B

0.8

B

0.45

B

0.64

This is the standard format with only one numerical column. The categorical column is optional unless you want to perform many normality tests at once.

Type 2

If you want to perform the normality test on an entire matrix, you must put all values into a single column.

                                   

ID

1

2

3

4

1

0.04

0.28

0.43

0.95

2

0.44

0.85

0.17

0.14

3

0.84

0.96

0.92

0.84

4

0.05

0.69

0.64

0.12

5

0.84

0.7

0.88

0.02

With this data structure, you must set the number of replications to 1 when you import your data.

You also have to include the ID column to run the normality test (though the column does not convey much meaningful information).

After that, you can run the normality test in the same way as the structure mentioned in Type 1.  

You can also apply this way for the dataset with multiple samples as long as there are no replications in each sample.

Type 3

You have a dataset with multiple samples, and each of them has a replication.

                                   

ID

Sample 1

Sample 2

Sample 3

1

0.87

0.7

0.47

0.4

0.11

0.05

2

0.03

0.84

0.91

0.26

0.48

0.23

3

0.79

0.32

0.82

0.25

0.96

0.4

4

0.52

0.08

0.75

0.47

0.71

0.43

5

0.48

0.62

0.91

0.47

0.29

0.07

With this data structure, you must provide the number of replications when you import the data (which is 2 in this example). You also have to include the ID column to run the normality test (though the column does not convey much meaningful information). After that, you can run the normality test in the same way as the structure mentioned in Type 1.

If you want to perform the normality test for each sample, drag the Factor 2 column to the Group placeholder. (This column is automatically generated by BioVinci after importing your data and contains your sample names.)

Parameters

There are no parameters for this function. However, you should know that the software performs 4 types of normality tests per run.

Pearson chi-square normality test

This test expects that all observations are independent and have equal chance to be selected from a fixed distribution/population. The sample size should be larger than 1000.

One-sample Kolmogorov-Smirnov test

This test assesses whether two underlying one-dimensional probability distributions differ. In this case, the first distribution comes from the user’s input (after standardized), and the second one is the normal distribution. It is a nonparametric test and sensitive to the differences in both location and shape of the empirical cumulative distribution functions of the two samples.

Jarque Bera Test

The test assesses whether the input data set has the skewness and kurtosis matching a normal distribution. It is suitable for samples that have more than 1000 observations.

Shapiro-Wilk normality test

This is a nonparametric test of normality. It is more powerful than the Kolmogorov-Smirnov test but much less accurate if there are ties in the data set. The bigger the sample size, the higher the chance for a false positive result. Thus, the software will skip this test if the sample size is larger than 5000.

Result

Take the following data for example.

Group

Value

A

0.28

A

0.02

A

0.28

A

0.85

A

0.61

A

0.98

A

0.72

A

1.3

A

0.29

A

0.92

B

0.32

B

0.69

B

0.51

B

0.9

B

0.75

B

0.69

B

0.57

B

0.8

B

0.45

B

0.64

Visualizations

Histogram

The histogram shows the distribution of the input data set and a simulated normal distribution that has the same mean and standard deviation. If Group is provided, the histogram will show the distribution of all the groups instead.

Cumulative distribution function

It shows the empirical cumulative distribution function of the input data set and the simulated normal distribution. If Group is provided, it will show the cumulative distribution of all the groups instead.

Tables

Statistics

The table shows the statistics from four different normality tests. The annotation depends on the p value: * for p < 0.05, ** for p < 0.01, *** for p < 0.001, and **** for p < 0.0001. Some journals may have a different requirement for p value annotation. Thus, users can customize the number of asterisks in the edit mode. If Group is provided, the software will generate the table for each group.

Cumulative distribution function

The table shows the information that helps construct the cumulative distribution function plot.

C. Correlation

For impatient people

1. This is the dataset

                                   

Conc A

Conc B

Conc C

Conc D

Conc E

Conc F

2.05

1.59

1.63

24.73

3.54

1.66

2.29

0.92

1.95

21.21

8.55

0.91

0.84

1.08

0.13

20.06

8.68

1.34

2.92

1.43

1.08

22.87

7.14

2.01

1.87

0.99

1.11

20.28

2.96

0.73

1.38

0.82

1.46

21.88

1.32

1.04

0.43

1.75

0.06

21.2

2.93

2.06

1.06

0.79

0.23

24.88

2.66

0.78

1.06

1.41

0.38

20.21

6.52

0.52

1.14

0.82

0.34

21.04

5.77

1.51

1.02

1.28

0.13

21.15

1.87

3.16

1.33

1.22

0.51

21.58

1.23

0.92

0.58

0.7

0.41

24.39

9.37

1.07

2.56

1.82

0.89

20.81

1.91

0.33

1.75

1.37

1.41

23.58

10.12

2.22

0.3

0.44

1.23

21.98

10.83

0.08

2. Import it

3. Choose the Analysis tab and opt for Correlation

4. Provide the inputs

5. Get the result

Inputs

This function only accepts the data structure as follows, where each sample is presented in a separate column.

Conc A

Conc B

Conc C

Conc D

Conc E

Conc F

2.05

1.59

1.63

24.73

3.54

1.66

2.29

0.92

1.95

21.21

8.55

0.91

0.84

1.08

0.13

20.06

8.68

1.34

2.92

1.43

1.08

22.87

7.14

2.01

1.87

0.99

1.11

20.28

2.96

0.73

1.38

0.82

1.46

21.88

1.32

1.04

0.43

1.75

0.06

21.2

2.93

2.06

1.06

0.79

0.23

24.88

2.66

0.78

1.06

1.41

0.38

20.21

6.52

0.52

1.14

0.82

0.34

21.04

5.77

1.51

1.02

1.28

0.13

21.15

1.87

3.16

1.33

1.22

0.51

21.58

1.23

0.92

0.58

0.7

0.41

24.39

9.37

1.07

2.56

1.82

0.89

20.81

1.91

0.33

1.75

1.37

1.41

23.58

10.12

2.22

0.3

0.44

1.23

21.98

10.83

0.08

To calculate the correlation between two variables, you can drag them to the Variable(s) and 2nd variable (optional) placeholder. In the figure below, the user calculates the correlation between Conc A and Conc B.

To calculate the correlation between more than two variables (pairwise), you can drag the whole matrix to the Variable(s) placeholder. In this case, BioVinci will automatically exclude all categorical columns prior to calculation.

Parameters

Method

This defines the method to calculate the correlation.

Pearson

The method assumes that both variables are normally distributed

Kendall

The correlation depends on the difference between the number of concordant pairs and discordant pairs without any assumptions about the distribution.

Spearman

The correlation depends on the difference between the ranks of corresponding variables without any assumptions about the distribution.

Transpose

This determines whether the software should transpose the matrix first. Thus, it only works when the input is a matrix.

Result

All the pictures below illustrate the following data set. All parameters are set as default.

Conc A

Conc B

Conc C

Conc D

Conc E

Conc F

2.05

1.59

1.63

24.73

3.54

1.66

2.29

0.92

1.95

21.21

8.55

0.91

0.84

1.08

0.13

20.06

8.68

1.34

2.92

1.43

1.08

22.87

7.14

2.01

1.87

0.99

1.11

20.28

2.96

0.73

1.38

0.82

1.46

21.88

1.32

1.04

0.43

1.75

0.06

21.2

2.93

2.06

1.06

0.79

0.23

24.88

2.66

0.78

1.06

1.41

0.38

20.21

6.52

0.52

1.14

0.82

0.34

21.04

5.77

1.51

1.02

1.28

0.13

21.15

1.87

3.16

1.33

1.22

0.51

21.58

1.23

0.92

0.58

0.7

0.41

24.39

9.37

1.07

2.56

1.82

0.89

20.81

1.91

0.33

1.75

1.37

1.41

23.58

10.12

2.22

0.3

0.44

1.23

21.98

10.83

0.08

Visualizations

Scatter plot

This scatter plot uses two variables to show the data points and a diagonal line spanning from the lowest to the highest values of X and Y. Thus, it only appears when the input has only two numerical columns.

Correlation table

BioVinci will show the pairwise correlation in a heatmap, which only appears when the input is a matrix.

Covariance table

The heatmap shows the pairwise covariance between all columns. It only appears when the input is a matrix.

Tables

Statistics

This table of correlation and covariance only appears if the input has only two numerical columns.

Correlation table

This table provides the information for the Correlation heatmap.

Covariance table

This table provides the information for the Covariance heatmap. It does not show the covariance of a variable with itself (the diagonal line), because it is not meaningful and harms the color scale of the heatmap.

D. Linear regression, Multiple linear regression, and Polynomial regression

For impatient people

1. This is the dataset         

Gene A

Gene B

Gene C

Gene D

Gene E

6.17

3.47

2.66

1.49

0.84

6.14

3.4

2.8

1.47

0.7

4.31

4

2.03

1.94

0.16

5.28

3.09

2.25

1.76

0.78

4.96

3.53

2.85

1.43

0.22

6.87

3.82

2.57

1.63

1

6.19

3.3

2.24

1.98

0.85

5.89

3.79

2.25

1.73

0.66

4.4

3.7

2.09

1.71

0.17

5.72

3.54

2.45

1.87

0.58

4.42

3.26

2.55

1.53

0.4

4.69

3.94

2.06

1.04

0.75

2. Import it

3. Choose the Analysis tab and opt for Regression

4. Provide the inputs

5. Get the result

Inputs

The function requires a numerical column for Y and at least one numerical column for X. So your data structure should have at least two numeric columns. We use the data below to illustrate the two types of regression.

Gene A

Gene B

Gene C

Gene D

Gene E

6.17

3.47

2.66

1.49

0.84

6.14

3.4

2.8

1.47

0.7

4.31

4

2.03

1.94

0.16

5.28

3.09

2.25

1.76

0.78

4.96

3.53

2.85

1.43

0.22

6.87

3.82

2.57

1.63

1

6.19

3.3

2.24

1.98

0.85

5.89

3.79

2.25

1.73

0.66

4.4

3.7

2.09

1.71

0.17

5.72

3.54

2.45

1.87

0.58

4.42

3.26

2.55

1.53

0.4

4.69

3.94

2.06

1.04

0.75

Case 1: Simple Linear Regression

To run this regression, drag your two variables to the X and Y placeholders.  

Case 2: Multiple Linear Regression

To run a regression with multiple X, you can put all the X into one single matrix using the Create New Variables function. Please refer to B. Create New Variables in Other functions part. After that, you just need to drag the newly created variable to the X placeholder. The other way to provide multiple X is to drag all the column, one by one, to the X placeholder.

The picture below shows how to run the regression between a Y (Gene A) and multiple X (Gene B to E) in both ways.

Parameters

Polynomial

This is the degree of polynomial. By default, this option is set at 1, which is for linear regression. The software can handle up to 5.

Include intercept

This option determines whether the regression should go through the (0, 0) origin.

Transpose matrix

This option determines whether the software should transpose the X axis first. Thus, it only works when the input for X is a matrix

Result

Visualizations

Basic

This is a scatter plot with the regression, which only appears with simple linear regression (Case 1 mentioned above). The shape of the regression depends on the polynomial.

Residuals

This plot shows the differences between the observed Y and the predicted Y.

Coefficients

This bar plot shows the p values of coefficients.

Tables

Statistics

This table shows the statistics to evaluate a regression model.

Coefficient

This table shows the information of each coefficient, including the intercept (if exists)

Function

This table shows the regression function.

Fitted table

This table shows the predicted (fitted) values and the differences with the observed values.

E. t-test, Paired t-test, Welch t-test, Wilcoxon Rank Sum test, Wilcoxon Signed Rank test, and F test

For impatient people

  1. This is the dataset

Group 1

Group 2

1.15

0.67

0.75

0.88

0.84

0.05

1.02

0.78

1.01

0.99

0.88

0.32

0.94

0.2

1.04

0.2

1.09

0.18

1

0.05

  1. Import it

  1. Choose the Analysis tab and opt for the function

  1. Provide inputs

  1. Get the result

Inputs

All the data structures below are valid for this test.

Type 1: The table of two numerical columns

The table has two numerical columns, which are the two samples that you want to compare.

Group 1

Group 2

1.15

0.67

0.75

0.88

0.84

0.05

1.02

0.78

1.01

0.99

0.88

0.32

0.94

0.2

1.04

0.2

1.09

0.18

1

0.05

Users can drag each column into each placeholder.

Type 2: The table of one numerical and one categorical column

You’ve got a table with two columns: A numerical column holding data of both the samples in comparison, and a categorical column with two unique labels to classify the values into each group. The order of these labels does not matter.

Value

Group

0.79

Group 1

0.78

Group 1

0.97

Group 1

1.08

Group 1

0.9

Group 1

0.87

Group 2

0.78

Group 2

1.05

Group 2

0.8

Group 2

0.79

Group 2

In this case, users need to put the Value into the Column 1 placeholder and Group into the Column 2 placeholder.

Type 3: The table that has a replication for each observation

Sample ID

Group 1

Group 2

1

1.15

0.75

0.67

0.32

2

0.84

1.02

0.88

0.2

3

0.94

1.04

0.78

0.18

5

1.09

1

0.99

0.05

In this case, you must set the number of replications (which is 2 for this example) when uploading/adding data. After that, BioVinci will automatically create a transformed table, and you can use this to run the test in the same way as Type 2.

Parameters

Hypothesis test

Two sample t-test

This is the most common parametric test that compares two independent samples. The test assesses whether there is any difference between the means of two samples. It assumes all variables are distributed normally and have the same variance.

Welch Two Sample t-test

This test also assesses whether there is any difference between the means of two independent samples. It assumes all variables are distributed normally without necessarily having the same variance.

Paired t-test

This test assesses whether there is any difference between the means of two dependent samples. It assumes all variables are distributed normally. The test can only work when the two samples have the same number of observations.

Wilcoxon Rank Sum test

This test is a nonparametric alternative of the Two Sample t-test. It assesses the difference of two samples based on the sum of ranks of all observations. Thus, it does not make any assumptions on the distribution of the samples.

Wilcoxon Signed Rank test

This test is a nonparametric alternative of the Paired t-test. It assesses the difference of two samples based on the difference of ranks of each pairs of observation.(clear?) Similarly, the test can only work when the two samples have the same number of observations.

F test of Equality of Variances

The test assesses the equality of the variance of two independent samples. It assumes that all samples are distributed normally.

Alternative hypothesis

Users can choose whether this is a two-tailed or one-tailed test.

Result

All the tables and graphics below are from the following example data (which is a Type 1 table). All the tests are two-tailed.

Group 1

Group 2

1.15

0.67

0.75

0.88

0.84

0.05

1.02

0.78

1.01

0.99

0.88

0.32

0.94

0.2

1.04

0.2

1.09

0.18

1

0.05

Visualizations

Boxplot

This is the only visualization for this function. If the two samples are significantly different, the software will add an asterisk to the boxplot. The annotation depends on the p value: * for p < 0.05, ** for p < 0.01, *** for p < 0.001, and **** for p < 0.0001. Some journals may have a different requirement for p value annotation. Thus, users can customize the number of asterisks in the edit mode.

Tables

Statistics

The result table differs among different hypothesis tests.

Two Sample t-test

Welch Two Sample t-test

Paired t-test

Wilcoxon Rank Sum test

Wilcoxon Signed Rank test

F test of Equality of Variances

F. One-way ANOVA, Kruskal-Wallis test, Barlette test, and Levene test

For impatient people

  1. Prepare your dataset

Value

Group

0.78

Group 1

0.17

Group 1

0.2

Group 1

0

Group 1

0.95

Group 2

0.95

Group 2

0.8

Group 2

0.82

Group 2

1.02

Group 3

0.54

Group 3

0.53

Group 3

0.96

Group 3

  1. Import it

  1. Choose the Analysis tab and opt for the function

  1. Drag your data into the placeholders

  1. Get the result

Inputs

All the data structures below are valid for this test.

Type 1: One numerical column and one categorical column

This table contains one numerical column holding values and one categorical column holding group labels of all the observations.

Value

Group

0.78

Group 1

0.17

Group 1

0.2

Group 1

0

Group 1

0.95

Group 2

0.95

Group 2

0.8

Group 2

0.82

Group 2

1.02

Group 3

0.54

Group 3

0.53

Group 3

0.96

Group 3

To run the test, users can put the numerical column and the categorical column to the Values and Factor placeholders, respectively.

Type 2: Each column represents a group

This table contains multiple columns, each of which holds the values of a group (or a sample). Please note that BioVinci requires your table to include ID column in order to run the test, as shown below:

ID

Group 1

Group 2

Group 3

1

0.78

0.95

1.02

2

0.17

0.95

0.54

3

0.2

0.8

0.53

4

0

0.82

0.96

With this structure, users must set the number of replications to 1 when importing/adding the data. After that, BioVinci will automatically create the transformed data, and you can use it to run the test in the same way as Type 1.

Type 3: Each group/sample has multiple replications

This table has replications for each observation.

                                   

ID

Group 1

Group 2

Group 3

1

0.78

0.17

0.95

0.95

1.02

0.54

2

0.2

0

0.8

0.82

0.53

0.96

In this case, users must set the number of replications (which is 2 in this example) when importing/adding the data. After that, BioVinci will automatically create the transformed data, and you can use it to run the test in the same way as Type 1.

Parameters

Hypothesis test

One-way ANOVA

This test assesses whether there is no difference among the means of all independent samples. It assumes all samples are normally distributed and have the same variances.

Kruskal Wallis

This test assesses whether there is no difference among the mean ranks of all independent samples. It is a non-parametric alternative to One-way ANOVA. It does not assume the normality of the samples.

Bartlett test

This test assesses whether all the samples have equal variances. It is essential for many statistical tests holding homogeneity of variances as the underlying assumption. It is sensitive to the departure from normality. So if your samples derive from non-normal distributions, Bartlett’s test is simply just to verify the non-normality.

Levene test

This test is an alternative to Bartlett’s test, but less sensitive to the departure from normality. You should use this test only when the samples derive from non-normal distributions.

Post-hoc test

In the meantime, BioVinci can only perform a post-hoc test with One-way ANOVA. These are the single-step multiple comparison procedures. It assesses the pairwise difference between all samples to check whether two samples come from the same distribution/population.

HSD

This test is also called the Tukey’s range test or the Tukey’s honest significant difference. Users can perform it in conjunction with the one-way ANOVA. It assumes that all samples are independent, have equal variances, and come from normal distributions.

LSD

This test is also called the Fisher’s least significant difference test. It is basically a pairwise t-test on all samples; thus, it inherits all the assumptions of a Two-sample t-test. You should only use this test when the result of one-way ANOVA is significant.

Duncan

This test is also called the Duncan’s multiple range test. After sorting all the samples by means, it provides the information of ranges. Thus, it is more permissive than the HSD procedure, which provides information of all pairs of samples.

Result

To explain the result of this function, we use the example data below. Please remember that this table belongs to Case 2 (where each column represents a group), so you have to set the number of replications to 1 when you import your data.

ID

Group 1

Group 2

Group 3

1

0.78

0.95

1.02

2

0.17

0.95

0.54

3

0.2

0.8

0.53

4

0

0.82

0.96

Visualizations

Box plot

Box plots visualize each sample as a box with whiskers. It only shows the asterisk when users choose to use a post hoc test and it has at least one pair that is significantly different.

Bar plot

Similar to box plots, this is another common way to show the results of a multiple comparison.

Tables

Statistics

The table shows the statistical results that support you to evaluate your comparison. It differs among different hypothesis tests.

One-way ANOVA

Kruskal Wallis

Bartlett test

Levene test

Post-hoc comparison

The table shows the pairwise comparison among samples. The number of asterisks depends on the p value of a particular pair of samples: * for p < 0.05, ** for p < 0.01, *** for p < 0.001, and **** for p < 0.0001. Some journals may have a different requirement for p value annotation. Users can customize the number of asterisks in the edit mode. This table below shows the results of an HSD post-hoc test.

G. Two-way ANOVA

For impatient people

1. This is the dataset                 

Treatment

Cancer

Control

Drug A

0.93

1.24

0.95

1.46

0.8

0.02

Drug B

2.15

1.6

1.77

1.58

1.31

1.06

placebo

1.49

1.88

1.97

0.91

0.08

0.94

2. Import it

3. Choose Analysis tab and opt for Two-way ANOVA

4. Provide the inputs

5. Get the results

Inputs

Type 1: One numerical column & Two categorical columns

You’ve got a dataset structured as below: one column that contains all the numerical values, and two categorical columns to classify such values into groups.

                                   

Treatment

Phenotype

Value

Drug A

Cancer

0.93

Drug B

Cancer

2.15

placebo

Cancer

1.49

Drug A

Cancer

1.24

Drug B

Cancer

1.6

placebo

Cancer

1.88

Drug A

Cancer

0.95

Drug B

Cancer

1.77

placebo

Cancer

1.97

Drug A

Control

1.46

Drug B

Control

1.58

placebo

Control

0.91

Drug A

Control

0.8

Drug B

Control

1.31

placebo

Control

0.08

Drug A

Control

0.02

Drug B

Control

1.06

placebo

Control

0.94

With this data structure, you can easily run the function by dragging each column to the appropriate placeholder. More specifically, you should place the numerical column at Values and two categorical columns at the Factor 1 and Factor 2 placeholders.

Type 2: One categorical column & Two numerical columns

This is the most common data structure, although it is not canonical. The first column represents the first grouping factor, while the others represent the second.

                                   

Treatment

Cancer

Control

Drug A

0.93

1.46

Drug B

2.15

1.58

placebo

1.49

0.91

Drug A

1.24

0.8

Drug B

1.6

1.31

placebo

1.88

0.08

Drug A

0.95

0.02

Drug B

1.77

1.06

placebo

1.97

0.94

With this kind of data, the user must set the number of replications to 1. After that, you can use the transformed table similar to a Type 1 table.

Type 3: Table with replications

This quite resembles the Type 2 table, but with replications. First, you need to inform the software the number of replications when you import your dataset. In this example, the number of replication is 3.                          

Treatment

Cancer

Control

Drug A

0.93

1.24

0.95

1.46

0.8

0.02

Drug B

2.15

1.6

1.77

1.58

1.31

1.06

placebo

1.49

1.88

1.97

0.91

0.08

0.94

After that, BioVinci will transform your table into a Type 1 table. Now you can simply run the function in the same way as Type 1 mentioned above.

Parameters

At the moment, only one parameter is available: the Two-way ANOVA (in the Hypothesis test drop list). The next versions may offer more two-way hypothesis alternatives.

Result

Visualizations

BioVinci visualizes the results as a grouped bar plot with error bars. Users can freely switch to a boxplot or a violin plot if appropriate.

Tables

BioVinci will display the comparison results in two tables corresponding to each factor.

Statistic of Factor 1

Statistic of Factor 2

H. Pearson's chi squared test, Fisher's exact test, and Barnard's test

For impatient people

1. This is the dataset

Gender

Left handed

Right handed

Male

44

8

Female

47

3

2. Import it

3. Choose the Analysis tab and opt for 2x2 contingency table

4. Drag your inputs into the placeholder and choose the appropriate test

5. Get the result

Inputs

All the structures below are valid for this test.

Type 1

This is the most common structure of a 2x2 contingency table.

Gender

Left handed

Right handed

Male

44

8

Female

47

3

Type 2

This table does not include the grouping column. In this case, the software will assign the two rows to Group 1 and Group 2, respectively.

Left handed

Right handed

44

8

47

3

Parameters

Hypothesis test

This drop list allows users to select the appropriate hypothesis test to assess the association between two categorical variables.

Pearson (Pearson's chi squared test)

This is the most widely used test for independence when using a contingency table to assess how much the observed distribution fits with the expected distribution if variables are independent.

It assumes that the sample data are collected by random sampling from a fixed distribution/population and all observations are independent of each other. This test should not be used when the expected value is less than 5.

Fisher (Fisher's exact test)

The Fisher's exact test assesses whether there is an association between the two variables by comparing the proportions. It assumes that the marginal totals are fixed (conditioned). This test can perform well for small sample size.

Barnard (Barnard's test)

Barnard's test is also an exact test that examines the association of two categorical variables. Unlike the Fisher's exact test, it relaxes on one set of the marginal totals and estimates the nuisance parameter. For 2x2 contingency table, it is a more powerful alternative than the Fisher's exact test.

Transpose

If you have a Type 2 contingency table, you can choose to transpose it before the test.

Result

The visualizations for different hypothesis tests are quite similar. Here are the example data (a Type 1 contingency table) for all the visualizations below.

Gender

Left handed

Right handed

Male

44

8

Female

47

3

Visualizations

Bar chart

BioVinci will generate a stacked bar chart to display two proportions.

Heatmap

This is your input contingency table with a color scale.

Residuals

This heatmap shows the differences of the observed values and expected values. It only appears in Pearson's chi squared test.

Nuisance parameter

This line chart shows how the p value changes with the nuisance parameter. This visualization only appears in Barnard's test.

Tables

Summary table

The summary table shows all the basic statistics, corresponding to the hypothesis test that you choose.

Pearson's chi squared test

Fisher's exact test

Barnard's test

Transformed data

You can find the transformed version of your contingency table on the left hand side. This structure is compatible with all types of basic plots, thus you can use it to construct some plots at your choice.

Residuals

The table shows the residual values (which compose the Residuals heatmap). It only appears in Pearson's chi squared test.

Nuisance table

This table presents the numbers used for the Nuisance parameter line chart. It only shows up in Barnard's test.


Machine learning functions

A. Random forest

Random forest is an ensemble learning method for classification, which constructs a multitude of decision trees. Classification results will be aggregated from all these decision trees.

This section illustrates how to construct the classification model (Random Forest).

For impatient people

1. This is the dataset

https://drive.google.com/file/d/1vknQdQOvpBKJfUYaw2NdiSycNwnP74vS/view?usp=sharing

2. Import it

3. Choose the Analysis tab and opt for Random forest

4. Prepare features

5. Provide the inputs to the placeholders

6. Get the results

Inputs

We use the iris dataset to explain the inputs of the Random Forest functions.

https://drive.google.com/file/d/1vknQdQOvpBKJfUYaw2NdiSycNwnP74vS/view?usp=sharing

Class

This placeholder should contain a categorical column, which classifies each observation by labels. For the iris dataset (in the example above), you should put the Species column into this placeholder.

Features

This placeholder shall contain a matrix of numerical columns. Each variable (column) is a feature. For the iris dataset, you have to create a table that does not include the Species column (please refer to our detailed instructions on how to create variables). You can also drag multiple numerical columns to this placeholder instead. But the Tranpose option will not work in this case.

Parameters

Number of trees

The number of decision trees that Random forests generate (Default is 100).

Seed state

This is the initial number for random algorithms in Random forests (Default is 1). Random forest is stochastic. It is important to set a random state so that your result is reproducible.

Results

We use the Iris dataset as an example for all the instructions below. The dataset contains a set of 150 records under 5 attributes - Petal Length , Petal Width , Sepal Length , Sepal width and Class. You can download the dataset here:

https://drive.google.com/file/d/1vknQdQOvpBKJfUYaw2NdiSycNwnP74vS/view?usp=sharing

Visualizations

Error rate

The figure presents the error rate of Random Forest method for Iris dataset. The horizontal axis represents the number of trees,the vertical axis represents the error rate (%), while the lines (in various colors) represent groups. When the number of trees changes, the error rate also changes correspondingly.

Importance Matrix

The software shows the importance of the features for each class in a heatmap (as above).

Confusion Matrix

This is the Confusion matrix for the Iris dataset. An (i, j) cell in the matrix contains the number of samples known to be in group i but predicted to be in group j. In the figure above, 50 samples belonging to Iris-setosa are predicted to be in Setosa group. 5 samples belonging to Iris-versicolor are predicted to be in  Iris-virginica.

Importance Table

This heatmap shows the mean decrease in accuracy and the GINI score of each feature.

Tables

Importance table

The table displays the importance of the features in each class and in general. It provides information for the heatmap of Importance matrix and Importance table.

Outliers

This table shows the residuals of the predictions compared to the observed result. It has two columns: the measurement (residual) and the row index. The order of observations in this table is the same as in the input data.

Class margins

The table shows the margin of the true class in each observation.

Confusion Matrix

This table shows the classification accuracy of the prediction. The diagonal values are the number of correct classifications, while off diagonal values are incorrect ones. The last column is the error rate.

Error rate

The table shows the changes in error rates when the number of tree increases. The Feature column lists out the labels for the error rates of all the classes. The General label is the overall error rate.

Time used

The table shows how many times a feature was used to create a branch in the decision tree.

B. k-means Clustering

k-means clustering is a popular method for cluster analysis in machine learning. k-means clustering aims to partition n samples into k clusters in which each sample belongs to the cluster with the nearest mean.

For impatient people

1. This is the dataset

https://drive.google.com/file/d/1vknQdQOvpBKJfUYaw2NdiSycNwnP74vS/view?usp=sharing

2. Import it

3. Choose Analysis tab and select k-means clustering

4. Drag the input data to appropriate placeholders and set the suitable parameters

5. Get the results

Inputs

In this section, we use the iris dataset to explain the inputs of the k-mean clustering function.

https://drive.google.com/file/d/1vknQdQOvpBKJfUYaw2NdiSycNwnP74vS/view?usp=sharing

Two dimensions

Users need to drag two dimensions, which must be numeric columns, to the A column / matrix and the 2nd coordinate (optional) placeholders, respectively (as with Petal.Width and Sepal.Width in the example below).

More than two dimensions

If you want to perform k-means clustering for high dimensional data, you need a table where each column is a feature, and drag it to the first placeholder (A column/matrix).

For the iris dataset, users can create a new variable that excludes the Species column (please refer to our detailed instructions on how to create variables). In this case, you do not need to drag any data to the 2nd coordinate (optional) placeholder.

Parameters

Number of means

This field defines the number of clusters as well as the number of centroids to generate.

Seed state

This field defines the initial number for random algorithms in Random forests (Default is 1). k-means clustering is stochastic. It is important to set a random state so that your result is reproducible.

Results

Visualizations

Basic visuals
Two dimensions

This is a Voronoi diagram. The number of partitions (or colors) is equal to the number of clusters.

More than two dimensions

This line plot shows the changes of values in each feature. The number of colors indicates the number of clusters (which is 3 in this example).

Cluster's information

This heatmap shows the information of the clusters. Sum.of.squares is the sum of squared distances from each data point to the center of each cluster. Size is the number of values in each cluster.

Tables

Basic information

This table provides the information for the Cluster’s information heatmap.

Clusters

This table shows which cluster each value belongs to.

Centers

The table shows the coordinates of the center of each cluster.

C. Hierarchical clustering

For impatient people

1. This is the dataset

                                   

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Sample 8

Sample 9

Sample 10

0.48

2.62

2.69

0.61

0.07

2.97

0.71

2.76

0.8

2.77

0.99

2.02

2.15

0.82

0.18

2.3

0.24

2.46

0

2.56

0.06

0.27

0.55

0.87

0.75

0.32

0.83

0.74

0.73

0.48

2.66

2.75

2.43

2.73

2.89

2.07

3

0.5

2.6

0.77

2.52

2.45

2.28

2.22

2.99

2.21

2.31

0.17

2.6

0.17

0.8

2.99

2.26

0.45

0.11

2.46

0.27

2.02

0.2

2.41

0.94

2.65

2.01

0.1

0.39

2.78

0.71

2.47

0.35

2.73

0.68

0.39

0.72

0.39

0.88

0.45

0.82

0.99

0.52

0.63

2.69

2.85

2.26

2.78

2.67

2.29

2.32

0.79

2.38

0.63

2.41

2.81

2.57

2.7

2.21

2.03

2.12

0.31

2.36

0.03

2.79

2.43

2.09

2.83

2.31

2.02

2.54

0.58

2.97

0.82

2.65

2.64

2.24

2.72

2.6

2.1

2.56

0.53

2.68

0.26

0.04

2.67

2.81

0.24

0.94

2.34

0.36

2.35

0.18

2.49

0.03

2.39

2.8

0.17

0.53

2.87

0.96

2.56

0.19

2.48

0.58

0.19

0.4

0.23

0.01

0.89

0.93

0.86

0.77

0.15

2.96

2.49

2.79

2.55

2.91

2.61

2.41

0.34

2.17

0.92

2.46

2.24

2.79

2.84

2.25

2.77

2.75

0.77

2.71

0.84

0.92

2.86

2.23

0.46

0.94

2.04

0.45

2.76

0.32

2.11

0.39

2.01

2.79

0.48

0.4

2.48

0.45

2.39

0.23

2.72

0.69

0.37

0.95

0.77

0.77

0.8

0.18

0.05

0.22

0.61

2. Import it

3. Choose Analysis tab and select Hierarchical clustering

4. Provide the inputs

5. Get the result

Inputs

You just need to make sure that all numeric columns in the table are appropriate for the calculation of distances. This function automatically excludes all categorical columns.

For example, you’ve got a table as below.

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

Sample 7

Sample 8

Sample 9

Sample 10

0.48

2.62

2.69

0.61

0.07

2.97

0.71

2.76

0.8

2.77

0.99

2.02

2.15

0.82

0.18

2.3

0.24

2.46

0

2.56

0.06

0.27

0.55

0.87

0.75

0.32

0.83

0.74

0.73

0.48

2.66

2.75

2.43

2.73

2.89

2.07

3

0.5

2.6

0.77

2.52

2.45

2.28

2.22

2.99

2.21

2.31

0.17

2.6

0.17

0.8

2.99

2.26

0.45

0.11

2.46

0.27

2.02

0.2

2.41

0.94

2.65

2.01

0.1

0.39

2.78

0.71

2.47

0.35

2.73

0.68

0.39

0.72

0.39

0.88

0.45

0.82

0.99

0.52

0.63

2.69

2.85

2.26

2.78

2.67

2.29

2.32

0.79

2.38

0.63

2.41

2.81

2.57

2.7

2.21

2.03

2.12

0.31

2.36

0.03

2.79

2.43

2.09

2.83

2.31

2.02

2.54

0.58

2.97

0.82

2.65

2.64

2.24

2.72

2.6

2.1

2.56

0.53

2.68

0.26

0.04

2.67

2.81

0.24

0.94

2.34

0.36

2.35

0.18

2.49

0.03

2.39

2.8

0.17

0.53

2.87

0.96

2.56

0.19

2.48

0.58

0.19

0.4

0.23

0.01

0.89

0.93

0.86

0.77

0.15

2.96

2.49

2.79

2.55

2.91

2.61

2.41

0.34

2.17

0.92

2.46

2.24

2.79

2.84

2.25

2.77

2.75

0.77

2.71

0.84

0.92

2.86

2.23

0.46

0.94

2.04

0.45

2.76

0.32

2.11

0.39

2.01

2.79

0.48

0.4

2.48

0.45

2.39

0.23

2.72

0.69

0.37

0.95

0.77

0.77

0.8

0.18

0.05

0.22

0.61

Then you can either drag the whole table or multiple columns to the Matrix placeholder. But the Transpose option is only available for the former method.

Parameters

Clustering method

You can choose the clustering method here, among Ward’s minimum variance, Complete linkage, Single linkage, UPGMA, and WPGMA, to determine how BioVinci will calculate the distance between clusters.

Distance method

In this drop list, you can choose the appropriate metric space for calculating the distance between variables, among Euclidean, Maximum, Manhattan, Canberra and Binary distance.

Cluster by Column

With this option, users can choose to show a dendrogram on the plot or not.

Cluster by Row

Please note that BioVinci always performs clustering on the columns. Here you can choose whether it should apply clustering on the rows, too, either with or without the dendrogram.

Log transform

You can choose to apply log transform for all values. If you select scaling or centering, the software will apply log transform before the scaling or centering function.

Transpose matrix

This option will transpose the matrix, then run the function will the newly transposed table.

Results

Visualizations

Hierarchical clustering

This is a heatmap with dendrogram that shows the clustering results on both rows and and columns.

Column Distance

This heatmap shows the pairwise distance between the columns.

Row distance

The heatmap shows the pairwise distance between the rows, which only appears when users turn on the Cluster by row option

Tables

Clustered Data

This table shows the data after clustering. The Row name indicates the original index.

Column Distance

This table provides information for the Column distance heatmap.

Row Distance

The table provides information for the Row distance heatmap.

D. Principal component analysis (PCA)

For impatient people

1. This is the dataset

https://drive.google.com/file/d/1vknQdQOvpBKJfUYaw2NdiSycNwnP74vS/view?usp=sharing

2. Import it

3. Choose the Analysis tab and select Principal component analysis

4. Provide the inputs

5. Get the result

Inputs

Case 1: Each column is a feature

Your data look like this:

                  

ID

Feature 1

Feature 2

Feature 3

Feature 4

Feature 5

Class

Sample 1

1.38

2.52

2.25

1.44

0.33

A

Sample 2

0.09

2.31

0.21

1.08

1.74

B

Sample 3

0.93

2.16

2.49

2.22

1.68

B

Sample 4

0.27

2.91

1.35

2.34

1.59

C

Sample 5

1.68

0.09

0.03

0.06

1.98

A

Sample 6

0.87

0.09

0

1.83

1.05

A

Sample 7

2.88

2.97

0.9

1.8

2.25

C

Sample 8

2.04

2.85

2.34

1.23

0.15

B

Sample 9

1.32

0.33

1.38

2.64

2.67

C

You want to run a PCA where each row is a data point on the plot. In this case, you can drag the whole table to the Features placeholder. The function will automatically remove all categorical columns from this table. If your sample is classified (as shown above in the Class column), you can drag this categorical column to the Class (optional) placeholder. If you don’t, the PCA can show only a single color for all data points.

You can also drag multiple columns to the Features placeholder instead. But the Transpose matrix option will not work in this case.

Case 2: Each column is an observation

If you are working with transcriptomic data, this kind of table is common: each row is a gene and each column is a sample. In this case, you want to run a PCA where each column is a data point in the plot.

                                   

Gene

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

ENSG00000157782

0.48

2.62

2.69

0.61

0.07

ENSG00000157783

0.99

2.02

2.15

0.82

0.18

ENSG00000157784

0.06

0.27

0.55

0.87

0.75

ENSG00000157785

2.66

2.75

2.43

2.73

2.89

ENSG00000157786

2.52

2.45

2.28

2.22

2.99

ENSG00000157787

0.8

2.99

2.26

0.45

0.11

ENSG00000157788

0.94

2.65

2.01

0.1

0.39

ENSG00000157789

0.68

0.39

0.72

0.39

0.88

ENSG00000157790

2.69

2.85

2.26

2.78

2.67

To run the function, you can drag the whole table to the Features placeholder. Then, check the Transpose matrix box.

In many cases, you also have a metadata table that gives more details on the samples. Please note that, in this metadata table, the order of the sample should be the same as in the data table.

                                   

ID

Gender

Age group

Phenotype

Sample 1

Male

Elder

Benign

Sample 2

Male

Elder

Malignant

Sample 3

Female

Adult

Benign

Sample 4

Male

Adult

Malignant

Sample 5

Female

Adult

Malignant

If you have a metadata table, import it by clicking at the Add data button on the top left corner (please refer to our instructions on Adding data). Then you can drag a categorical column from that metatable to the Class (optional) placeholder.

Parameter

Filter missing values by

This option determines how BioVinci handles missing values. If the filter mode is features (default), features with missing values will be excluded. If the filter mode is samples, samples with missing values will be excluded.

Log transform

With this option, you can choose to apply log transform to all values. If you select scaling or centering, the software will apply log transform before the scaling or centering function.

Transpose matrix

This option will transpose the matrix, then run the function will the newly transposed table.

Results

We use the Iris dataset for all the visualizations below.

https://drive.google.com/file/d/1vknQdQOvpBKJfUYaw2NdiSycNwnP74vS/view?usp=sharing

Visualizations

3D scatter plot

This plot presents samples in 3D space in which the axes are the first three principal components.

Proportion of Variance

Tables

Basic Information

Principal components

E. t-distributed stochastic neighbor embedding

For impatient people

1. This is the dataset

https://drive.google.com/file/d/1vknQdQOvpBKJfUYaw2NdiSycNwnP74vS/view?usp=sharing

2. Import it

3. Choose Analysis tab and select t-distributed stochastic neighbour

4. Provide the inputs

5. Get the result

Inputs

Case 1: Each column is a feature

You’ve got a data table as below, each column of which represents a feature, and you want to run a t-SNE where each row is a data point.

                                   

ID

Feature 1

Feature 2

Feature 3

Feature 4

Feature 5

Class

Sample 1

1.38

2.52

2.25

1.44

0.33

A

Sample 2

0.09

2.31

0.21

1.08

1.74

B

Sample 3

0.93

2.16

2.49

2.22

1.68

B

Sample 4

0.27

2.91

1.35

2.34

1.59

C

Sample 5

1.68

0.09

0.03

0.06

1.98

A

Sample 6

0.87

0.09

0

1.83

1.05

A

Sample 7

2.88

2.97

0.9

1.8

2.25

C

Sample 8

2.04

2.85

2.34

1.23

0.15

B

Sample 9

1.32

0.33

1.38

2.64

2.67

C

To run this function, you can drag the whole table to the Features placeholder. The function will automatically exclude all categorical columns from your inputs. If your sample is classified (as shown above in the Class column), you can drag this categorical column to the Class (optional) placeholder. If you don’t, the t-SNE plot can show only a single color for all data points.

You can also drag multiple columns to the Features placeholder, instead of the whole table. But the Transpose matrix option will not work in this case.

Case 2: Each column is an observation

If you are working with transcriptomic data, you will find this table familiar: each row is a gene and each column is a sample. And you want to run a t-SNE where each column is a data point in the plot.

                                   

                                   

Gene

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

ENSG00000157782

0.48

2.62

2.69

0.61

0.07

ENSG00000157783

0.99

2.02

2.15

0.82

0.18

ENSG00000157784

0.06

0.27

0.55

0.87

0.75

ENSG00000157785

2.66

2.75

2.43

2.73

2.89

ENSG00000157786

2.52

2.45

2.28

2.22

2.99

ENSG00000157787

0.8

2.99

2.26

0.45

0.11

ENSG00000157788

0.94

2.65

2.01

0.1

0.39

ENSG00000157789

0.68

0.39

0.72

0.39

0.88

ENSG00000157790

2.69

2.85

2.26

2.78

2.67

To run the function, drag the whole table to the Features placeholder. Then, activate the Transpose matrix option.

In many cases, you may have a metadata table that gives more details about the sample as below. Please note that, in this metadata table, the order of the samples should be the same as in the data table.

                                   

ID

Gender

Age group

Phenotype

Sample 1

Male

Elder

Benign

Sample 2

Male

Elder

Malignant

Sample 3

Female

Adult

Benign

Sample 4

Male

Adult

Malignant

Sample 5

Female

Adult

Malignant

If you have a metadata table, import it by clicking at the Add data button on the top left corner (please refer to our instructions on Adding data). Then you can drag a categorical column from that metatable to the Class (optional) placeholder.

Parameters

Theta

This is the trade-off between speed and accuracy for Barnes-Hut T-SNE. ‘theta’ is the angular size of a distant node as measured from a point. If this size is below ‘theta’ then it is used as a summary node of all points contained within it. This method is not very sensitive to changes in this parameter in the range of 0.2 - 0.8. Angles less than 0.2 have quickly increasing computation time and angles greater than 0.8 have quickly increasing error.

Perplexity

The perplexity sort of says how to balance attention between local and global aspects of your data. The perplexity is related to the number of nearest neighbors that are used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. The choice is not extremely critical since t-SNE is quite insensitive to this parameter.

Seed state

This number is used to initialize a random number.

Learning rate

The learning rate for t-SNE is usually from 10 to 1000. If the learning rate is too high, the data may look like a ‘ball’, with any point approximately equidistant from its nearest neighbors. If the learning rate is too low, most points may look compressed in a dense cloud with few outliers. If the cost function gets stuck in a bad local minimum, increasing the learning rate may help.

Number of iterations

This is the maximum number of iterations for the optimization. It should be at least 250. By default, BioVinci sets this number at 1000.

Filter missing values by

This option determines how BioVinci handles missing values. If the filter mode is features (default), features with missing values will be excluded. If the filter mode is samples, samples with missing values will be excluded.

Log transform

Here you can choose to apply log transform for all values. If you select scaling or centering, the software will apply log transform before the scaling or centering function.

Transpose matrix

This option will transpose the matrix, then run the function with the newly transposed table.

Results

Visualizations

Basic visual

Iteration costs

Tables

t-SNE Coordinates

Iteration costs

F. Sparse canonical correlation analysis

For impatient people

1. Here are the datasets (2 CSV files)

https://drive.google.com/file/d/1121kl4PX5zjzBkkVehMulBmoZhiNXZm4/view?usp=sharing

https://drive.google.com/file/d/1ilwxRiaiF7hcvzEStTJMWg3zR_FDEvsg/view?usp=sharing

2. Import the first file

3. Import the second file (using the Add data button)

4. Choose Analysis tab and select Sparse canonical correlation analysis

5. Provide the inputs

6. Get the result

Inputs

You need to provide two data tables for the function to find the correlation between matrices. You can upload an Excel file that contains two sheets, or upload each CSV file manually (import the first one, then use the Add data function to upload the other).

Parameters

Number of canonical vectors

This determines the number of canonical vector pairs obtained from two matrices.

Number of permutations

Number of permutations to be run to select the best parameters.

Log transform

Here you can choose to apply log transform for all values. If you select scaling or centering, the software will apply log transform before the scaling or centering function.

Result

Visualizations

BioVinci can only visualize the results in the Correlation plot, which is a scatter plot with a diagonal line to help you visually estimate the correlation. It also shows an annotation for the correlation and the covariance. The number of plots depends on the number of canonical vectors.

Tables

BioVinci uses only the table of non-zero coefficients to list out the results. The number of tables is determined by the number of canonical vectors. Each table shows the coefficients of each row in the two matrices in the corresponding vector.


Editing functions

A. Edit mode

To use most of the editing functions, click on the Edit plot button right above your plot.

B. Points and lines

Once you’ve landed on the Edit plot mode, just click on any points and a setting panel will pop up to customize the appearance of the data points on your plot.

Fill

This setting adjusts the color or the pattern of your data points; it also allows changing their shape.

Stroke

With this setting, you can change the line color, line width, and line style.

Individual or Group

By default, the Group box is always checked each time you customize something, which means your changes will be applied for the whole group of data on your plot.

To customize a single data point without affecting the others, click on it and tick the Individual box before changing its color and pattern.

C. Boxes, bars, and violins

To customize the bars, boxes, and violins, users need to click on the object while in the Edit plot mode. The panel below will pop up.

You can adjust the color or pattern of the objects in Fill and the borderline in Stroke. To customize a single object without affecting the others in the same group, click on it and tick the Individual box before changing its color and pattern.

D. Background

Users can click on the background while in the Edit plot mode to change the background color and opacity. This setting only controls the plot background, not the whole plotting area. In the picture below, just the plot background is gray, while the other parts of the picture are transparent.

E. Grid lines

Hide/show grid lines

Users can hide or show the gridlines by using the Grid line tick box on the top corner in Edit plot. This option affects both X and Y gridlines.

Edit grid lines

You can click on any grid lines in the plot to customize their width, style, and color. To hide just the horizontal grid lines or vertical grid lines on the plot, click on them, and uncheck the Show grid box.

Edit zero lines

To customize the zero line, click on it when you’re in the Edit plot mode. You can freely edit its width, style, color, and even choose to hide it.

F. Legend

Relocate the legend

You can click and drag to move the legend around the plot area. This function is available in both normal mode and edit mode.

Hide/show the legend

Users can hide or show the legend by using the Legend tick box in Edit plot.

Edit legend

Users can edit the content of the legend by using double-clicks while in the edit mode. It only has visual effects and won’t affect the hover information.

G. Color scale

The color scale is a special setting for heatmap in the Edit plot mode. Just click on the color scale on the heatmap, then you can choose among the sample scales on the left side, and reverse it at your choice.

BioVinci also allows you to create your own color scale. Simply click at each color, or input the color code (HEX) and hit the plus, one by one. If you’re not satisfied with your scale, click at the Reset button (the rotation arrows) and start over again.

The animation below shows how to customize the color scale both ways.

I. Text

By using the term Text, we mean all kinds of text on the plot, including the plot title, axis labels, legends, and annotations.

Relocate

You just need to turn on the Edit plot mode, then click and drag the text to reposition it.

Edit text

To open this text editing panel, just click Edit plot and simply double click on any text you wish to change. Beside changing the content, you can also customize the font family, font style, size, and color.

Add text

To add text, you can use the Add text button in the edit mode.

J. Axis

Tick values

In the Edit plot mode, users can click on any axis tick values to customize their appearance, the axis range and the number of ticks on the axis.

As shown above, you can adjust the angle by dragging the angle controller or entering the angle by yourself. You can also hide/show the tick values.

To adjust the number of ticks and the axis range, users can enter a suitable number to the text boxes on the right corner.

Axis lines

In the edit mode, users can click on the axis to customize its width, style and color.

K. Size

Plot size

All the size options are located on the bottom right corner. Make your plot ready for publication by using the common sizes that we offer. Otherwise, you can customize the size at your own choice.

One-column

This setting scales your plot into a square, which has the side length of 8.8 centimeters, or 3.46 inches.

Two-columns

This setting scales your plot into a square, which has the side length of 18.3 centimeters, or 7.2 inches.

Full screen

This setting scales your plot to fit the screen so you don’t have to scroll to view all the details.

Manual

The Custom icon represents the manual mode, where users can freely decide the width, height, and units of measurement (among centimeter, millimeter, inch, and pixel).

If you choose to size your plot by pixel, you should note that the pixel density differs among screens (see more PPI and PPCM). Therefore, a plot measured in pixel may appear different on different machines.

Margins

Users can adjust the margins by dragging the margin lines while in the Edit plot mode. This option is useful when you have long tick values that often go off the plotting area. The animation below illustrates how you can customize the margins.

L. Templates

To be ready for publication, a plot must meet many strict requirements: less color, bold title, sans font family, heavy line, etc. But it is time-consuming to edit every single detail.

To tackle this issue, BioVinci offers users 3 publication-ready templates that can be applied for any plot type.

Publication-ready

Black and white

This template meets the most common standards for publication. It uses patterns (stripes, dots, etc.), or shapes (round, triangle, star, plus, etc. ) instead of colors. All the fonts are sans font family. The titles and axes are bold with clear tick labels. Its configuration is also flexible for different settings, even within the same plot type.

Gray scale

This template has almost the same configuration as the Black and white, but it uses shades of gray instead. Each gray intensity is at least 20% different from each other so that the viewers can easily identify the groups.

Color

This template has almost the same configuration as the Black and white but it uses a unique set of colors, which is commonly used in many online articles.

Classic

This is the default template. It may not be suitable for publication, but it works well for many other purposes. With interactive objects and zooming functions, you can share and present the results to your partners and collaborators.

Other functions

A. Export data

Once you’re satisfied with the plot, hit the Export button. This allows you to create a portable PNG image. The Desktop Edition even offers you the vector-based graphic, including PDF, SVG, and EPS.

You can export CSV files of the data tables from analyses as well, which is called Statistic.

B. Create variable

Some statistical and machine learning functions require you to prepare an appropriate matrix, which just includes some of the columns from your initial datasheet. In such cases, this function will help.

Common cases

Here are the instructions for some common cases in data analysis. The instructions use the example data below.

Sample

Tissue

Phenotype

Conc A

Conc B

Conc C

Conc D

Conc E

Conc F

Patient A

Liver

Benign

2.05

1.59

1.63

24.73

3.54

1.66

Patient A

Pancreas

Benign

2.29

0.92

1.95

21.21

8.55

0.91

Patient B

Liver

Benign

0.84

1.08

0.13

20.06

8.68

1.34

Patient C

Kidney

Benign

2.92

1.43

1.08

22.87

7.14

2.01

Patient C

Liver

Benign

1.87

0.99

1.11

20.28

2.96

0.73

Patient C

Pancreas

Benign

1.38

0.82

1.46

21.88

1.32

1.04

Patient D

Kidney

Benign

0.43

1.75

0.06

21.2

2.93

2.06

Patient D

Pancreas

Benign

1.06

0.79

0.23

24.88

2.66

0.78

Patient E

Kidney

Malignant

1.06

1.41

0.38

20.21

6.52

0.52

Patient E

Liver

Malignant

1.14

0.82

0.34

21.04

5.77

1.51

Patient F

Kidney

Malignant

1.02

1.28

0.13

21.15

1.87

3.16

Patient G

Kidney

Malignant

1.33

1.22

0.51

21.58

1.23

0.92

Patient G

Pancreas

Malignant

0.58

0.7

0.41

24.39

9.37

1.07

Patient H

Kidney

Malignant

2.56

1.82

0.89

20.81

1.91

0.33

Patient H

Liver

Malignant

1.75

1.37

1.41

23.58

10.12

2.22

Patient H

Pancreas

Malignant

0.3

0.44

1.23

21.98

10.83

0.08

Remove columns

The picture below shows how to remove the first 3 columns: Sample, Tissue, and Phenotype.

Users need to click at Create variable button on the top corner, then choose the Exclude option and list out the column(s) that shouldn’t appear in the new matrix.

But it really takes time if you want to remove many columns. Thus, BioVinci allows excluding a range of columns at once. You can define that range by the From and To fields (see how we exclude the first 3 columns in the picture below).

If you want to remove many columns that do not stay next to each other, you can switch to the Include option and list out the columns that you want to keep.

In many cases, the Exclude text columns option will be very helpful because it removes all categorical columns in your range. You can see how we keep Conc A, Conc C, Conc D, and Conc E columns in the picture below, combining Exclude and Exclude text columns options.

Options

Variable name

This field is required. It defines the name that shows up in the data tree after you create a matrix. The picture below shows where the variable named “All numbers” appears.

Exclude / Include (Select columns)

Name box

You can list the columns in this field. If you select Include, the new matrix will contain such columns only, while the Exclude option will remove them from your new matrix. 

In case of duplicated column names, the software can still understand which columns you want to select. However, we do not recommend duplicated names because some functions may not work properly.

From / To

This option is a substitute for the Name box above; thus, you cannot use both of them at the same time. From / To defines which range of columns to keep or exclude, depending on what you select: the Include mode or the Exclude mode.

Select Rows

By default, all rows are selected. You can alter this by entering the starting and ending row on the two sides of the colon (:).

Eg: type “1:12” if you want to specify the range from Row 1 to 12.

If the rows you want to select do not stay next to each other, you can enter several ranges and separate them with a comma (,).

Eg: type “1:12, 14:18, 20:20” if you want to select a range from Row 1 to 20 but not including Row 13 and 19.

Additional functions

Select from data review

Besides the options, users can also select(?) directly on the data review. Each of your moves will change the settings on the right accordingly.

Select by drawing an area

This is useful for small datasets because users can click and drag to select a range directly on the data table. This also changes the settings accordingly.

C. Manage plot

You can save a plot in a workset and load them anytime in BioVinci. The list of all your saved plots lies at the right hand side, below the home button. We call it the Gallery panel throughout this section.

Load a plot

To load a plot, simply click at its thumbnail on the Gallery panel. If you are working on an unsaved plot, a warning message will pop up.

Save a plot

To save a plot, you can use the Save plot button. A thumbnail of that plot will appear in the Gallery panel.

Delete a plot

To delete a plot, you have to load it first. Then you can use the Delete button on the top corner.

Update a plot

Update features

The updating function will only appear if you are editing a saved plot. In this case, its thumbnail in the Gallery panel will show a Save icon. You can click on this icon to update your plot.

Update data

When you change your data, the thumbnails of the plots affected by such changes will show a warning icon. You can either update these plots or not. If you don’t, please note that these plots will mismatch your current data (which has just been updated).

Share a plot

You can share a plot with your collaborators, even if they do not have BioVinci on their machine through email addresses or social media. The reviewers can only access the interactive figure. Your raw data table and parameters are inaccessible without your authority.

To privately share the plot with your collaborators, enter their email addresses and your message in the text box. Then they will receive a static image of the plot (and your message of course). BioVinci will also send them an URL that links to the landing page, where they can see the interactive figure and leave comments using their Facebook accounts. The picture below shows an example of a landing page.

To share on social media, click the icon(s), and you will be directed to the corresponding platform where you say some words about this. The picture below shows an example of a plot shared via Facebook.

D. Manage workset

Data storage

If you are using the Desktop Edition, all data are stored in the folder called BioTData. You can change this location in the software settings.

Create a workset

To create a workset, click at Import your dataset on the main window/page.

Manual

You can use the sheet editor to enter or paste your data manually.

Upload a file

BioVinci supports CSV, TSV, XLS, and XLSX formats. If your Excel file has more than one sheet, it will create several data trees in your new workset.

Number of replications

This field determines how to transform your data after importing. If your data contain merged column labels that show repetition in your experiment, you have to input the number of replications in order to transform your matrix into an appropriate one for running statistical functions.

 

Above is an example of a dataset with 3 replications. In this case, you need to input “3” into the Number of replications field upon importing your data. BioVinci will transform your data into a matrix as shown on the left side of the figure below.

If you don’t have merged labels in your data, you don’t have to worry about this option. By default, the replication is set to 0, which means no data transformation afterwards. You can find more information about this at our blog.

Delete a workset

You can delete your workset if you are in the main window/page. Please consider carefully because this action is irreversible.

E. Suggested statistics

BioVinci can suggest appropriate statistical function(s) based on the data and the plot type that you are using. These suggestions are located on the top right corner. Users can hover the mouse over the thumbnail to see the suggested functions and click to quickly apply them on the current input data.

F. Add data

Sometimes you need to use more than one table at a time; for example, a metadata table for PCA and t-SNE function, and a second data table for the Sparse canonical correlation. We support you with the Add data button right on the menu bar on top of the software.

The window for adding data is similar to the importing window. You can input your data manually or upload the entire file. All options such as First row as headers and Number of replications work exactly the same as described in the Manage Workset section.

You can switch among all the datasets that you have added. Just click at the gray button on the top left corner to show a drop list of all your sheets.

G. Edit data/variables

Change column types

There are 3 types of columns in BioVinci: numerical, categorical, and unique. You can change the type of a column by clicking on its type icon to open a drop list.

Numerical

When you change a column type into numerical, BioVinci will try to convert the content to numbers. If the software fails to convert, it will put a null value instead.

Categorical

If you change a column type into categorical, the software will leave the content as is. Numbers, e.g 1, 0.5, or -2.1, will become strings, e.g “1”, “0.5”, or “-2.1”.

The categorical column is very important because it helps classify the values of the numerical column into groups. But you should check the labels in categorical columns carefully since the classification is case-sensitive. For instance, BioVinci identifies “Benign” and “benign” as two different labels, thus the resulting plot will show both of them as 2 separate groups.

Unique ID

If a column type is unique ID, the software will leave the content as is, similar to categorical. This type is automatically assigned to a column that has all values different from each other (also case-sensitive).

All plot types and functions ignore Unique ID columns, except the tooltips in scatter plot. This prevents users from accidentally selecting such columns to label the observations, which ends up providing too many groups for the software and harms the performance.

Change column names

Users can double click the column names on the data tree to edit them. However, the plot will not automatically respond to this change. To update the column name on the plot, you need to drag the newly-updated column name to the placeholder again.

You can also use Edit data function to edit the column name(s) (please refer to the next section on Edit data)

Edit data

To edit your data, you can click on the Edit data button on the upper left corner of the software. This function helps you customize the names of the columns, change the data, remove/add more columns or rows.

Data without replications (number of replications = 0)

If you upload your data with 0 replication, the editing interface will allow you to edit your data. The first row shows the column names and cannot be empty. Any time you delete a column name, the software will generate a column name automatically.

You can right-click and choose among the options available: insert row/column or remove row/column.

Data with replications (number of replications > 0)

If the number of replications is larger than 1, the editing interface will show two matrices: your original one and the transformed one.

Data with replications are processed in a different way compared with normal ones. The transformed matrix on the left shows how BioVinci reads your inputs. You can edit the original data on the right in the same way as Data without replications and see the transformed data changes correspondingly. Please note that the transformed data are not editable, except the column names.

H. Switching between plot type

You can quickly change the plot type by using the plot menu bar at the bottom. If there is no input available (all the placeholders are empty), trigger any icon in that menu bar only change the placeholder panel.

If there are input(s) in the placeholder, the software will activate the plot type that are valid with the current input(s). The validity of a plot type depends on the type of the column, which are numerical or categorical. In the picture below, the icons are color of with red, blue, and gray for current, valid, and invalid plot type(s) respectively.

For example, if you are having a grouped bar chart, which has two categorical columns and one numerical column, you can quickly switch the current plot type into bar chart, violin, scatter, and line chart with this function.


List of dependencies

This is the list of all the R packages that BioVinci uses for the calculations related to statistics and machine learning. We also include the names of the functions and the versions of the packages.

Package

Version

Function

plotly

4.7.1

All basic plots

stats

3.4.2

All statistical functions;

All machine learning functions

ggplot2

2.2.1

Kernel density curve;
95% confidence ellipse

randomForest

4.6-12

Random Forest

Rtsne

0.13

t-SNE

agricolae

1.2-8

Post-hoc test in multiple comparison

heatmaply

0.14.2

Hierarchical clustering

tseries

0.10-42

Jarque-Bera normality test

nortest

1.0-4

Pearson chi-squared normality test

deldir

0.1-14

Voronoi diagram

car

2.1-5

Levene test

Barnard

1.8

Barnard’s test

PMA

1.0.9

Sparse canonical correlation analysis