Drag-and-drop is the basis for operating in BioVinci. Draggable items are columns of your tabular data or even the whole table.

Sometimes, bringing data to a ready-to-process format can be time consuming. To save users’ time, we have provided a list of commonly used templates (examples). You can use these templates to have a sense of the data structures that are valid for an analysis.

Forget about using another SVG/photo editor to fine-tune your plots. BioVinci allows you to customize your plots in detail, including ticks, scales, rotations, fonts, sizes, colors, styles, etc. You can visit the Edit functions section to see what the software can do.

We always treat your data with the highest level of security and privacy. First, BioVinci does not change the original inputs (your XLXS or CSV) but instead operates on a copied version of the input files. Then, in the app version, these copied data are stored in a default BioTData directory in your home folder. If you want to backup, archive, transfer or copy all of your BioVinci data you’ll need to copy this entire folder. ( Later, we hope to allow you to easily select individual projects).

Behind BioVinci is a team of agile computational gurus. If you need a new feature, please don’t hesitate to drop us a line at: info@bioturing.com

SPECIAL NOTE FOR MACINTOSH USERS. In order to access certain features, you will need to perform a “right-click”. This may depend on the system you are using so please select “System preferences…” in the apple menu and then choose “mouse” or “trackpad”. Then make sure that “secondary click” is enabled (for trackpad users for example, this is under the “Point & Click” subsection and requires a two-finger click).

There are 8 types of basic plot in BioVinci. Each plot type has a different set of placeholders, which is on the left-side (next to the data tree). To create a plot, user have to drag column(s) from the data tree to the placeholder of a plot type.

Each plot type also has a unique set of options, which can be assessed by using right-click on the plot area. For Macintosh users, this may depend on the system you are using so please select “System preferences…” in the apple menu and then choose “mouse” or “trackpad”. Then make sure that “secondary click” is enabled (for trackpad users for example, this is under the “Point & Click” subsection and requires a two-finger click).

To illustrate how to create different types of histogram and density plots, we use the following example data. The table contains concentration of 6 markers (A to F) from 8 patients in three benign/malignant tissues (liver, pancreas, and kidney).

Sample | Tissue | Phenotype | Conc A | Conc B | Conc C | Conc D | Conc E | Conc F |

Patient A | Liver | Benign | 2.05 | 1.59 | 1.63 | 24.73 | 3.54 | 1.66 |

Patient A | Pancreas | Benign | 2.29 | 0.92 | 1.95 | 21.21 | 8.55 | 0.91 |

Patient B | Liver | Benign | 0.84 | 1.08 | 0.13 | 20.06 | 8.68 | 1.34 |

Patient C | Kidney | Benign | 2.92 | 1.43 | 1.08 | 22.87 | 7.14 | 2.01 |

Patient C | Liver | Benign | 1.87 | 0.99 | 1.11 | 20.28 | 2.96 | 0.73 |

Patient C | Pancreas | Benign | 1.38 | 0.82 | 1.46 | 21.88 | 1.32 | 1.04 |

Patient D | Kidney | Benign | 0.43 | 1.75 | 0.06 | 21.2 | 2.93 | 2.06 |

Patient D | Pancreas | Benign | 1.06 | 0.79 | 0.23 | 24.88 | 2.66 | 0.78 |

Patient E | Kidney | Malignant | 1.06 | 1.41 | 0.38 | 20.21 | 6.52 | 0.52 |

Patient E | Liver | Malignant | 1.14 | 0.82 | 0.34 | 21.04 | 5.77 | 1.51 |

Patient F | Kidney | Malignant | 1.02 | 1.28 | 0.13 | 21.15 | 1.87 | 3.16 |

Patient G | Kidney | Malignant | 1.33 | 1.22 | 0.51 | 21.58 | 1.23 | 0.92 |

Patient G | Pancreas | Malignant | 0.58 | 0.7 | 0.41 | 24.39 | 9.37 | 1.07 |

Patient H | Kidney | Malignant | 2.56 | 1.82 | 0.89 | 20.81 | 1.91 | 0.33 |

Patient H | Liver | Malignant | 1.75 | 1.37 | 1.41 | 23.58 | 10.12 | 2.22 |

Patient H | Pancreas | Malignant | 0.3 | 0.44 | 1.23 | 21.98 | 10.83 | 0.08 |

A single-group histogram / density plot shows the distribution of only one variable.

To create this kind of plot, you can drag a numerical variable to the Value placeholder. In this example, we use column Conc A.

If you want to create a density plot for Conc A instead, you can customize the plot options using right-click (please see special note for Macintosh users). We will explain the these options in the next section.

A multiple-group histogram / density plot shows the distribution of two or more variables.

You can use a numerical variable as Value and a categorical variable as Color. In this example, we used Conc A and Phenotype, respectively. The software created 2 histogram/density curves for the two groups defined in Phenotype.

You can put multiple numerical variables in the Value placeholder to create a multiple-group histogram/density plot. In this example, we use Conc A and Conc B for Value.

You can split your histogram into different groups by using a categorical variable as Split by. In this example, we use Conc A, Phenotype, and Tissue as Value, Color, and Split by, respectively. The software first split the data into three parts, which are Liver, Pancreas, and Kidney. Then it created the plot separately for each part.

This option controls the component of the plot, whether it should be a histogram, or a density plot, or both.

Your histogram will display the frequency of different ranges in bars.

Your plot will show a curve instead. This is the kernel density^{[a]}^{[b]} curve of your data. You can turn on both histogram and density plot options.

Where users can decide how different groups appear on the histogram.

All the groups will overlap each other and appear in a lexicographical order.

All the group will stack up to 100 percent in lexical order.

It stacks all groups similarly to the Fill position but does not scale the total value to 100 percent.

You can add a vertical line to the plot to mark the mean or median with this option.

This option controls how the software calculates the height of the bars in the histogram. Thus, you can only see changes if the Histogram (in Components) is activated.

The height of a bar is equal to the counts of the value in its range.

The height of a bar is equal to the proportion of the counts. The total of the bars’ heights is equal to 100.

The height of a bar is equal to the proportion of the counts. The value is scaled to 1. Thus, the total of the bars’ heights is equal to 1.

The height of a bar is equal to the density of the value in its range. It is equal to the counts divided by the range’s size.

The height of a bar is equal to the probability of the density. It is equal to the density divided by the number of values in the variable.

This option allows application of log transform on the Y axis. Linear, which is the normal scale, is set by default.

At the moment, user cannot change the number of bins (for histogram) or the size of the bins. The number and size of bins follows the Sturge’s Rule, with a maximum of 50 bins.

Pie charts present data in proportions where the arc length of each slice of the chart is proportional to the quantity it represents. In this section, we will illustrate how to construct pie charts using a familiar example data below.

Sample | Tissue | Phenotype | Conc A | Conc B | Conc C | Conc D | Conc E | Conc F |

Patient A | Liver | Benign | 2.05 | 1.59 | 1.63 | 24.73 | 3.54 | 1.66 |

Patient A | Pancreas | Benign | 2.29 | 0.92 | 1.95 | 21.21 | 8.55 | 0.91 |

Patient B | Liver | Benign | 0.84 | 1.08 | 0.13 | 20.06 | 8.68 | 1.34 |

Patient C | Kidney | Benign | 2.92 | 1.43 | 1.08 | 22.87 | 7.14 | 2.01 |

Patient C | Liver | Benign | 1.87 | 0.99 | 1.11 | 20.28 | 2.96 | 0.73 |

Patient C | Pancreas | Benign | 1.38 | 0.82 | 1.46 | 21.88 | 1.32 | 1.04 |

Patient D | Kidney | Benign | 0.43 | 1.75 | 0.06 | 21.2 | 2.93 | 2.06 |

Patient D | Pancreas | Benign | 1.06 | 0.79 | 0.23 | 24.88 | 2.66 | 0.78 |

Patient E | Kidney | Malignant | 1.06 | 1.41 | 0.38 | 20.21 | 6.52 | 0.52 |

Patient E | Liver | Malignant | 1.14 | 0.82 | 0.34 | 21.04 | 5.77 | 1.51 |

Patient F | Kidney | Malignant | 1.02 | 1.28 | 0.13 | 21.15 | 1.87 | 3.16 |

Patient G | Kidney | Malignant | 1.33 | 1.22 | 0.51 | 21.58 | 1.23 | 0.92 |

Patient G | Pancreas | Malignant | 0.58 | 0.7 | 0.41 | 24.39 | 9.37 | 1.07 |

Patient H | Kidney | Malignant | 2.56 | 1.82 | 0.89 | 20.81 | 1.91 | 0.33 |

Patient H | Liver | Malignant | 1.75 | 1.37 | 1.41 | 23.58 | 10.12 | 2.22 |

Patient H | Pancreas | Malignant | 0.3 | 0.44 | 1.23 | 21.98 | 10.83 | 0.08 |

This is the most common pie chart: a single circle with one or more colors. You can create this kind of plot in many ways.

To see the proportion of different tissues used in the experiments, we can drag the Tissue variable into the Value placeholder. The software will count the number of times each value appears in that variable to calculate its proportion.

You can use a numerical column as Value and a categorical column as Color to split this column into multiple numerical vectors. The software will add up the values based on the group assigned in Color.

You can create multiple pie plots at once by using the Split by placeholder with a categorical variable (column). The software will split the data to multiple pie charts using the grouping information in that new categorical variable.

You can choose among three options below to select what information should be presented on the plot.

This is the group’s name.

This is the counts of each categorical value (Case 1) or the totals of each group (Case 2)

This is the percentage of each group’s values.

The picture below shows a pie chart with all these three options activated.

All options in this setting have the same meanings as those of the Text info. Once selected, the information will appear when you hover your mouse over the graph.

This option determines whether the categories should be sorted based on the percentage (descending).

If you activate this option, the software will create a hole in the middle of the pie chart. The diameter of this hole is exactly one third of the pie.

A bar plot shows comparisons among discrete categories. One axis of the plot shows the specific categories being compared, and the other axis represents a measured value, which can be frequency, absolute value, percentage, mean, or median. We now explore different ways to create a bar plot in BioVinci using the familiar example data below.

Sample | Tissue | Phenotype | Conc A | Conc B | Conc C | Conc D | Conc E | Conc F |

Patient A | Liver | Benign | 2.05 | 1.59 | 1.63 | 24.73 | 3.54 | 1.66 |

Patient A | Pancreas | Benign | 2.29 | 0.92 | 1.95 | 21.21 | 8.55 | 0.91 |

Patient B | Liver | Benign | 0.84 | 1.08 | 0.13 | 20.06 | 8.68 | 1.34 |

Patient C | Kidney | Benign | 2.92 | 1.43 | 1.08 | 22.87 | 7.14 | 2.01 |

Patient C | Liver | Benign | 1.87 | 0.99 | 1.11 | 20.28 | 2.96 | 0.73 |

Patient C | Pancreas | Benign | 1.38 | 0.82 | 1.46 | 21.88 | 1.32 | 1.04 |

Patient D | Kidney | Benign | 0.43 | 1.75 | 0.06 | 21.2 | 2.93 | 2.06 |

Patient D | Pancreas | Benign | 1.06 | 0.79 | 0.23 | 24.88 | 2.66 | 0.78 |

Patient E | Kidney | Malignant | 1.06 | 1.41 | 0.38 | 20.21 | 6.52 | 0.52 |

Patient E | Liver | Malignant | 1.14 | 0.82 | 0.34 | 21.04 | 5.77 | 1.51 |

Patient F | Kidney | Malignant | 1.02 | 1.28 | 0.13 | 21.15 | 1.87 | 3.16 |

Patient G | Kidney | Malignant | 1.33 | 1.22 | 0.51 | 21.58 | 1.23 | 0.92 |

Patient G | Pancreas | Malignant | 0.58 | 0.7 | 0.41 | 24.39 | 9.37 | 1.07 |

Patient H | Kidney | Malignant | 2.56 | 1.82 | 0.89 | 20.81 | 1.91 | 0.33 |

Patient H | Liver | Malignant | 1.75 | 1.37 | 1.41 | 23.58 | 10.12 | 2.22 |

Patient H | Pancreas | Malignant | 0.3 | 0.44 | 1.23 | 21.98 | 10.83 | 0.08 |

Let’s first explore how each patient is involved in different measurements of the experiment. By simply dragging the “Sample” variable into the X placeholder, we obtain a histogram below.

^{[c]}

If you use only one categorical column as the X, the software will calculate the frequency, or how many times a value appears in that column, and use this data for the Y.

^{[d]}^{[e]}

We now would like to compare the mean concentration of biomarkers A, B, C in all the experiments. By simply dragging 3 variables - Conc A, Conc B, and Conc C - to the Y placeholder, a histogram showing the mean concentration of A, B, and C is created. Error bars can be created using right-click options.

Another way to create a single group bar plot is to use both X and Y. This is perhaps the most common case. Users pick a numerical column for Y and a categorical column to X. The software will calculate the mean or median (depending on users’ options) for each group defined by X.

The grouped bar plot has an additional group factor beside the one on the X axis. Sometimes people call it the two-way bar plot, which is helpful to show two-way comparisons. You can construct this kind of plot with either of the two methods below.

User uses a categorical column as X, a numerical column as Y, and another categorical column as Color.

Another way is to use a categorical value as X and multiple numeric columns as Y. The software will classify values of these numerical columns (in Y) into their categories defined in X, and build the bars for each category.

You can create a split bar plot by using an additional categorical column for the Split by placeholder. The software will split the data first, then generate each bar plot separately.

This option determines how the software calculates the bar’s height. Users can choose mean, median, or sum (default).

This option controls how to visualize the error bar, whether it should represent the standard deviation (SD) or the standard error of the mean (SEM). You can only see this option works when the there are replications in each category on the x-axis.

Above is a grouped bar plot with Mean and SD.

This option determines the orientation of the bars. If it is horizontal, the software basically swaps between X and Y axes.

This option determines the position of bars in a Grouped bar plot.

Bars are placed next to each other normally as shown in the previous pictures in this section.

Bars are placed on top of each other.

This option determines the shape of the error bar, whether it has two arms or only an upper arm.

This option shows/hides the value on each bar.

Users choose whether the software should apply log transform for the bars’ heights.

According to Wikipedia, a line chart or line plot is a type of chart that displays information as a series of data points, called 'markers,' connected by straight line segments. A line plot is similar to a scatter plot except that the measurement points are ordered (typically by their x-axis value) and joined with straight line segments. A line chart is often used to visualize a trend in data over intervals of time – a time series – thus the line is often drawn chronologically. This section will show you how to create different types of line plots. All the pictures use the example data below.

Time | Phenotype | Conc A | Conc B | Conc C |

0 | Benign | 0.24 | 0.85 | 1.54 |

0 | Benign | 0.3 | 0.71 | 1.06 |

0 | Malignant | 1.39 | 0.86 | 0.89 |

0 | Malignant | 0.82 | 0.99 | 0.58 |

6 | Benign | 0.73 | 1.46 | 1.39 |

6 | Benign | 1.08 | 0.47 | 1.26 |

6 | Malignant | 0.83 | 1.04 | 1.34 |

6 | Malignant | 1.11 | 0.77 | 0.51 |

12 | Benign | 1.08 | 1.17 | 2.48 |

12 | Benign | 1.11 | 1.96 | 1.98 |

12 | Malignant | 1.75 | 1.24 | 1.45 |

12 | Malignant | 2.25 | 0.29 | 1.94 |

18 | Benign | 0.85 | 0.63 | 1.42 |

18 | Benign | 1.8 | 1.69 | 1.58 |

18 | Malignant | 1.94 | 2.57 | 1.01 |

18 | Malignant | 2.39 | 2.63 | 1.99 |

24 | Benign | 2.17 | 0.83 | 3.06 |

24 | Benign | 2.61 | 1.44 | 3.21 |

24 | Malignant | 2.09 | 1.05 | 1.42 |

24 | Malignant | 1.62 | 2.86 | 2.82 |

To see how Conc A changes by time, user may drag Time to X and Conc A to Y, resulting in the following graph.

Users can put a numerical column in Y and a column (regardless of its type) in X. The software will draw a line that simply connects all the data points in increasing order of the X coordinate.

But this is clearly not the plot that you want. As your data contain replications for each time, you would like to see how the mean or median of Conc A changes over time. A more proper plot should look like below.

In this case, you should change the X column to categorical. BioVinci will then calculate the mean/median at each value of X and draw a line that goes through each of them. As X is now categorical, the ordering for connecting all the points of the plot will be inferred from their order in the column.

If users provide just a numerical column to the Y placeholder, the software will use the indices as X.

Users can pick multiple numerical columns for Y to create multiple lines for the line plot. The X axis will contain the indices.

Users can choose a numerical column for Y, any kind of column for X, and a categorical column for Color.

Type of column for X is very important, as described in the Standard line plot.

To create a split line plot, use an additional categorical column for the Split by placeholder. The software will split the data first, then generate each line plot separately.

This option determines how the software calculates Y values if X is categorical. Users can switch between mean and median. The suffixes _se and _sd stand for standard error and standard deviation, respectively. Changing these options tells the software to add the error bar on the plot using the provided information.

This option controls how to visualize the error bar, whether it should have one or two arms, with or without points. You may not see the error bar change even when you have switched this option because the Summary method is set to be mean by default, which provides no information for the error bar. You can only see this option work when the Summary method is switched to an option with a suffix of _sd or _se.

The figure above illustrates a Case 2 multiple line plot with mean_sd and lower_pointrange.

You can use scattered points (“jitter”) or a box (“boxplot”) to make your line plot more informative.

With these options (“line” and “line + point”), users can choose to show points at every value on X.

This option determines whether the line plot should use spline interpolations instead of straight lines.

Users choose whether the software should apply one of several log transforms for the lines.

Users choose whether the software should apply one of several log transforms for the lines.

According to Wikipedia, a scatter plot (also called a scatter plot, scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two or three variables for a set of data.This section shows how to create different types of scatter plots using the data below.

Time | Phenotype | Conc A | Conc B | Conc C |

0 | Benign | 0.24 | 0.85 | 1.54 |

0 | Benign | 0.3 | 0.71 | 1.06 |

0 | Malignant | 1.39 | 0.86 | 0.89 |

0 | Malignant | 0.82 | 0.99 | 0.58 |

6 | Benign | 0.73 | 1.46 | 1.39 |

6 | Benign | 1.08 | 0.47 | 1.26 |

6 | Malignant | 0.83 | 1.04 | 1.34 |

6 | Malignant | 1.11 | 0.77 | 0.51 |

12 | Benign | 1.08 | 1.17 | 2.48 |

12 | Benign | 1.11 | 1.96 | 1.98 |

12 | Malignant | 1.75 | 1.24 | 1.45 |

12 | Malignant | 2.25 | 0.29 | 1.94 |

18 | Benign | 0.85 | 0.63 | 1.42 |

18 | Benign | 1.8 | 1.69 | 1.58 |

18 | Malignant | 1.94 | 2.57 | 1.01 |

18 | Malignant | 2.39 | 2.63 | 1.99 |

24 | Benign | 2.17 | 0.83 | 3.06 |

24 | Benign | 2.61 | 1.44 | 3.21 |

24 | Malignant | 2.09 | 1.05 | 1.42 |

24 | Malignant | 1.62 | 2.86 | 2.82 |

This is the basic scatter plot. All you need is X and Y. Any kind of columns will work.

This is the 3D scatter plot. Similar to Case 1, you just need to add one more column to Z. But all of the X, Y, and Z data has to be numerical.

Users can add other grouping factors to the scatter plot by using Color or Size. The two options require a categorical column and a numerical column, respectively. They work on both 2D and 3D scatter plots.

To create a split scatter plot, you can drag an additional categorical column to the Split by placeholder. The software will split the data first, then generate each scatter plot separately. This function only works for a 2D scatter plot. (In this example, we split the data with the “Time” column. In the left side of the screen under “Source” we changed it to alpha A-Z data so that it would could be dragged over to “Split by”)

Tooltip controls the information that pops up when users hover the mouse over an item. By default, it displays the information of the columns that contribute to the plot. Users can make it more informative by dragging any column to this placeholder.

This option performs a regression on the plot based on the data points. If there is a group factor at Color, the software will generate a regression for each group. Regression is currently only available for 2D scatter plots. In the picture above, we applied linear regression on a scatter plot of Conc A and Con B, with Phenotype as Color.

Users can also add a 95% confidence ellipse based on the data points using the stat_ellipse package in ggplot2. If there is a group factor at Color, the software will generate an ellipse for each group. 95% confidence ellipse is only available for 2D scatter plots. In the picture above, we applied T-distribution on a scatter plot of Conc A and Con B, with Phenotype as Color.

This options determines whether the regression line should have two 95% confidence interval splines. You can only see this option works when there is a regression line on the plot.

Users choose whether the software should apply one of several log transforms for the points.

Users choose whether the software should apply one of several log transforms for the points.

A violin plot is a common alternative to the box plot. Instead of showing the data quartiles with a whisker, it reveals a full distribution with two kernel density curves on both sides. This unique feature forms the violin-like shape and makes this plot much more informative. In this section, we will show you how to create different types of violin plots using the example data below:

Sample | Tissue | Phenotype | Conc A | Conc B | Conc C | Conc D | Conc E | Conc F |

Patient A | Liver | Benign | 2.05 | 1.59 | 1.63 | 24.73 | 3.54 | 1.66 |

Patient A | Pancreas | Benign | 2.29 | 0.92 | 1.95 | 21.21 | 8.55 | 0.91 |

Patient B | Liver | Benign | 0.84 | 1.08 | 0.13 | 20.06 | 8.68 | 1.34 |

Patient C | Kidney | Benign | 2.92 | 1.43 | 1.08 | 22.87 | 7.14 | 2.01 |

Patient C | Liver | Benign | 1.87 | 0.99 | 1.11 | 20.28 | 2.96 | 0.73 |

Patient C | Pancreas | Benign | 1.38 | 0.82 | 1.46 | 21.88 | 1.32 | 1.04 |

Patient D | Kidney | Benign | 0.43 | 1.75 | 0.06 | 21.2 | 2.93 | 2.06 |

Patient D | Pancreas | Benign | 1.06 | 0.79 | 0.23 | 24.88 | 2.66 | 0.78 |

Patient E | Kidney | Malignant | 1.06 | 1.41 | 0.38 | 20.21 | 6.52 | 0.52 |

Patient E | Liver | Malignant | 1.14 | 0.82 | 0.34 | 21.04 | 5.77 | 1.51 |

Patient F | Kidney | Malignant | 1.02 | 1.28 | 0.13 | 21.15 | 1.87 | 3.16 |

Patient G | Kidney | Malignant | 1.33 | 1.22 | 0.51 | 21.58 | 1.23 | 0.92 |

Patient G | Pancreas | Malignant | 0.58 | 0.7 | 0.41 | 24.39 | 9.37 | 1.07 |

Patient H | Kidney | Malignant | 2.56 | 1.82 | 0.89 | 20.81 | 1.91 | 0.33 |

Patient H | Liver | Malignant | 1.75 | 1.37 | 1.41 | 23.58 | 10.12 | 2.22 |

Patient H | Pancreas | Malignant | 0.3 | 0.44 | 1.23 | 21.98 | 10.83 | 0.08 |

This violin plot has a categorical X axis, a numerical Y axis, and no other grouping factor(s).

This kind of violin plot requires one or more numerical columns for the Y placeholder. Each column in Y will generate one violin on the plot.

You can also use a categorical column as X and a numerical column as Y. Each group defined in X will create a violin using the corresponding values in Y.^{[f]}^{[g]}

This violin has an additional grouping factor, which determines the color of the violin plot.

You can create this plot by using a categorical column for X and multiple numerical columns for Y.

You can use a categorical column for X, a numerical column for Y, and a categorical column for Color. For each group defined by X, the software will create subgroups defined by Color.

You can create a split violin plot by using an additional categorical column for the Split by placeholder. The software will split the data first, then generate each violin plot separately.

This option determines the orientation of the violin. It basically swaps the X and Y axes.

This option allows users to show or hide all the data points or just outliers.

This option controls positioning the data points, whether they should lie right on the violin or next to it. You will not see any changes unless you activate the Points option.

Scale mode controls how to calculate the widths of the violins.

The width of each violin depends on the number of data points.

All the violins within the same subgroup (not group) will have the same area.

Activate this option to add a box to each violin.

Activate this option to fuse two violins into one. It can only work with a Grouped violin plot and with only two subgroups.

This option allows you to apply one of the log transforms for the Y values before generating the violin plot.

A box plot is a method for graphically depicting groups of numerical data through their quartiles. It reveals more information about the distribution than the bar plot by showing the quartiles. This section illustrates how to create different kinds of box plots using the following data.

Sample | Tissue | Phenotype | Conc A | Conc B | Conc C | Conc D | Conc E | Conc F |

Patient A | Liver | Benign | 2.05 | 1.59 | 1.63 | 24.73 | 3.54 | 1.66 |

Patient A | Pancreas | Benign | 2.29 | 0.92 | 1.95 | 21.21 | 8.55 | 0.91 |

Patient B | Liver | Benign | 0.84 | 1.08 | 0.13 | 20.06 | 8.68 | 1.34 |

Patient C | Kidney | Benign | 2.92 | 1.43 | 1.08 | 22.87 | 7.14 | 2.01 |

Patient C | Liver | Benign | 1.87 | 0.99 | 1.11 | 20.28 | 2.96 | 0.73 |

Patient C | Pancreas | Benign | 1.38 | 0.82 | 1.46 | 21.88 | 1.32 | 1.04 |

Patient D | Kidney | Benign | 0.43 | 1.75 | 0.06 | 21.2 | 2.93 | 2.06 |

Patient D | Pancreas | Benign | 1.06 | 0.79 | 0.23 | 24.88 | 2.66 | 0.78 |

Patient E | Kidney | Malignant | 1.06 | 1.41 | 0.38 | 20.21 | 6.52 | 0.52 |

Patient E | Liver | Malignant | 1.14 | 0.82 | 0.34 | 21.04 | 5.77 | 1.51 |

Patient F | Kidney | Malignant | 1.02 | 1.28 | 0.13 | 21.15 | 1.87 | 3.16 |

Patient G | Kidney | Malignant | 1.33 | 1.22 | 0.51 | 21.58 | 1.23 | 0.92 |

Patient G | Pancreas | Malignant | 0.58 | 0.7 | 0.41 | 24.39 | 9.37 | 1.07 |

Patient H | Kidney | Malignant | 2.56 | 1.82 | 0.89 | 20.81 | 1.91 | 0.33 |

Patient H | Liver | Malignant | 1.75 | 1.37 | 1.41 | 23.58 | 10.12 | 2.22 |

Patient H | Pancreas | Malignant | 0.3 | 0.44 | 1.23 | 21.98 | 10.83 | 0.08 |

You can create boxplots by putting one or more numerical columns into the Y placeholder. Each column in Y generates one boxplot.

You can also use a categorical column as X and a numerical column as Y. Each group in X will create a box using the corresponding values in Y.

This box plot has an additional grouping factor, which determines the color of the box.

You can create this plot by using a categorical column for X and multiple numerical columns for Y.

You can use a categorical column for X, a numerical column for Y, and a categorical column for Color. For each group defined by X, the software will create subgroups defined by the Color.

You can create a split box plot by using an additional categorical column for the Split by placeholder. The software will split the data first, then generate each box plot separately (the Theme will automatically be changed to “Classic”).

This option determines the orientation of the boxes. It basically swaps the X and Y axes.

You can choose to show/ hide the data points here or even just the outliers.

This option controls the position of the data points, whether they should lie right on the box or next to it. You will not see any changes unless you activate the Points option.

This options can add annotations that tell the mean and SD of each box. The picture above shows how these annotation looks like in action.

This option allows applying one of the log transforms for the Y values before generating the box plot.

A Venn diagram is a powerful visualization tool that describes the relationship among a finite number of sets and their intersections. The size of the circles in this diagram approximately represents the cardinality of those sets. This section shows you the way to create a Venn diagram. All the pictures come from the example data below.

Patient ID | Group | Gender | Age group | Respond 1 | Respond 2 | Respond 3 |

1 | Alzheimer | Male | Adult | 2 | 3 | 9 |

2 | Parkinson | Male | Adult | 3 | 6 | 6 |

3 | Multiple sclerosis | Female | Teenager | 3 | 7 | 6 |

4 | Alzheimer | Male | Elder | 3 | 4 | 8 |

5 | Parkinson | Male | Adult | 3 | 5 | 9 |

6 | Multiple sclerosis | Female | Adult | 2 | 5 | 6 |

7 | Multiple sclerosis | Male | Adult | 5 | 4 | 8 |

8 | Multiple sclerosis | Female | Teenager | 5 | 6 | 8 |

9 | Alzheimer | Female | Elder | 3 | 3 | 5 |

10 | Alzheimer | Female | Elder | 3 | 3 | 9 |

11 | Parkinson | Male | Elder | 2 | 6 | 6 |

You can input the data in two ways.

You can provide a column of any kind to the Value placeholder and a categorical column to the Color placeholder. The software calculates intersections among all the groups and generates the Venn diagram.

You can drag multiple columns of the same kind to the Value placeholder. In this case, each column will form a group in the Venn diagram.

Venn diagrams can belong to one of the following types, depending on the intersection of the input sets.

This is the Venn diagram shown in the previous pictures. Each group contains items that are in common with at least one other group.

If the sets have no intersection, the software will generate separate circles. In the picture below, the two groups, male and female, have nothing in common.

If two sets contain exactly the same items, two circles will overlap each other. In the picture below, we have two equal sets, male and female.

If one set includes the other(s), the circle(s) of the subset(s) will lie completely inside that of the superset. In the picture below, Multiple Sclerosis eclipses two equal sets, Alzheimer and Parkinson.

This command determines what information to show on the Venn diagram, including the group name (Set label), group size (Set size), and the intersection size (Intersection size). The picture below has all three options activated.

By default, BioVinci will create for each set a subset that has no two items with the same value, then construct the Venn diagram. Thus the intersection will not contain any repetitive values.

Eg: giving two sets: A = {1, 2, 2, 2, 3} and B = {2, 2, 3, 3}

Correspondingly, the software will create A’={1,2,3} and B’= {2,3}

And the intersection will be {2,3}

But users can turn off this setting, by turning on the Multiset option. Given two multi sets, A = {1, 2, 2, 2, 3} and B = {2, 2, 3, 3}, the intersection will be {2, 2, 3}.

It defines whether empty cells should be considered a value.

1. Prepare your datasheet

Sample | Tissue | Phenotype | Conc A | Conc B | Conc C | Conc D | Conc E | Conc F |

Patient A | Liver | Benign | 2.05 | 1.59 | 1.63 | 24.73 | 3.54 | 1.66 |

Patient A | Pancreas | Benign | 2.29 | 0.92 | 21.21 | 8.55 | 0.91 | |

Patient B | Liver | Benign | 0.84 | 1.08 | 20.06 | 8.68 | 1.34 | |

Patient C | Kidney | Benign | 1.43 | 1.08 | 22.87 | 7.14 | 2.01 | |

Patient C | Liver | Benign | 0.99 | 1.11 | 20.28 | 2.96 | 0.73 | |

Patient C | Pancreas | Benign | 1.38 | 0.82 | 1.46 | 21.88 | 1.32 | 1.04 |

Patient D | Kidney | Benign | 0.43 | 1.75 | 0.06 | 21.2 | 2.93 | 2.06 |

Patient D | Pancreas | Benign | 1.06 | 0.23 | 24.88 | 2.66 | 0.78 | |

Patient E | Kidney | Malignant | 1.06 | 1.41 | 0.38 | 20.21 | 6.52 | 0.52 |

Patient E | Liver | Malignant | 1.14 | 0.82 | 0.34 | 21.04 | 5.77 | 1.51 |

Patient F | Kidney | Malignant | 1.02 | 1.28 | 0.13 | 21.15 | 1.87 | 3.16 |

Patient G | Kidney | Malignant | 1.33 | 1.22 | 0.51 | 1.23 | 0.92 | |

Patient G | Pancreas | Malignant | 0.58 | 0.41 | 24.39 | 9.37 | 1.07 | |

Patient H | Kidney | Malignant | 2.56 | 1.82 | 0.89 | 20.81 | 1.91 | 0.33 |

Patient H | Liver | Malignant | 1.75 | 1.37 | 1.41 | 23.58 | 10.12 | |

Patient H | Pancreas | Malignant | 0.3 | 0.44 | 1.23 | 21.98 | 0.08 |

2. Import it

3. Choose the Analysis tab and opt for Basic statistics function

4. Drag your data to the placeholder, then hit Run

5. Get the results

The function basic statistics can process any kinds of inputs, from categorical columns, numerical columns to matrices. You can drag one or multiple columns, or a matrix that you want to investigate to the A column/matrix placeholder.

Transpose is the only option available for basic statistics. This option allows you to transpose your matrix first before conducting the analyses. This option only has effects on the matrix input.

The types of your inputs determine how the results will be presented. If you input a matrix, the function will be applied to each column of the matrix.

If you input a numerical column, BioVinci will display the distribution of your data values in a histogram..

If you input a categorical column, you will see the frequency of your categorical data values presented in a bar plot.

If your inputs are in a matrix, each column will form a separate subplot (as shown in the figure above) that can be either a histogram or a bar plot, corresponding to the type of that column. This plot only appears when the input is a matrix and has less than 30 columns.

This heatmap displays the types of data value in each cell of your datasheet (blue for categorical data, red for numerical data, and black for missing values).

This table presents all the descriptive statistics for each input numerical column.

This table shows the frequency of each categorical data value, including the missing value. Therefore, it only appears if the input has at least one categorical column.

When the input is a matrix, BioVinci will generate either a descriptive statistics table or a frequency table for each column, depending on the type of the column.

In statistics, the normality test can help the analysts evaluate whether a dataset is well-modeled by a normal distribution. It is a prerequisite for many parametric tests holding the normal distribution as the underlying assumption.

1. Here are the example data

Group | Value |

A | 0.28 |

A | 0.02 |

A | 0.28 |

A | 0.85 |

A | 0.61 |

A | 0.98 |

A | 0.72 |

A | 1.3 |

A | 0.29 |

A | 0.92 |

B | 0.32 |

B | 0.69 |

B | 0.51 |

B | 0.9 |

B | 0.75 |

B | 0.69 |

B | 0.57 |

B | 0.8 |

B | 0.45 |

B | 0.64 |

2. Import your data to BioVinci

3. Choose the Analysis tab and opt for Normality test

4. Drag your data to the Value placeholder

5. Get the result

Below are all the valid data structures for a normality test in BioVinci.

Group | Value |

A | 0.28 |

A | 0.02 |

A | 0.28 |

A | 0.85 |

A | 0.61 |

A | 0.98 |

A | 0.72 |

A | 1.3 |

A | 0.29 |

A | 0.92 |

B | 0.32 |

B | 0.69 |

B | 0.51 |

B | 0.9 |

B | 0.75 |

B | 0.69 |

B | 0.57 |

B | 0.8 |

B | 0.45 |

B | 0.64 |

This is the standard format with only one numerical column. The categorical column is optional unless you want to perform many normality tests at once.

If you want to perform the normality test on an entire matrix, you must put all values into a single column.

ID | 1 | 2 | 3 | 4 |

1 | 0.04 | 0.28 | 0.43 | 0.95 |

2 | 0.44 | 0.85 | 0.17 | 0.14 |

3 | 0.84 | 0.96 | 0.92 | 0.84 |

4 | 0.05 | 0.69 | 0.64 | 0.12 |

5 | 0.84 | 0.7 | 0.88 | 0.02 |

With this data structure, you must set the number of replications to 1 when you import your data.

You also have to include the ID column to run the normality test (though the column does not convey much meaningful information).

After that, you can run the normality test in the same way as the structure mentioned in Type 1.

You can also apply this way for the dataset with multiple samples as long as there are no replications in each sample.

You have a dataset with multiple samples, and each of them has a replication.

ID | Sample 1 | Sample 2 | Sample 3 | |||

1 | 0.87 | 0.7 | 0.47 | 0.4 | 0.11 | 0.05 |

2 | 0.03 | 0.84 | 0.91 | 0.26 | 0.48 | 0.23 |

3 | 0.79 | 0.32 | 0.82 | 0.25 | 0.96 | 0.4 |

4 | 0.52 | 0.08 | 0.75 | 0.47 | 0.71 | 0.43 |

5 | 0.48 | 0.62 | 0.91 | 0.47 | 0.29 | 0.07 |

With this data structure, you must provide the number of replications when you import the data (which is 2 in this example). You also have to include the ID column to run the normality test (though the column does not convey much meaningful information). After that, you can run the normality test in the same way as the structure mentioned in Type 1.

If you want to perform the normality test for each sample, drag the Factor 2 column to the Group placeholder. (This column is automatically generated by BioVinci after importing your data and contains your sample names.)

There are no parameters for this function. However, you should know that the software performs 4 types of normality tests per run.

This test expects that all observations are independent and have equal chance to be selected from a fixed distribution/population. The sample size should be larger than 1000.

This test assesses whether two underlying one-dimensional probability distributions differ. In this case, the first distribution comes from the user’s input (after standardized), and the second one is the normal distribution. It is a nonparametric test and sensitive to the differences in both location and shape of the empirical cumulative distribution functions of the two samples.

The test assesses whether the input data set has the skewness and kurtosis matching a normal distribution. It is suitable for samples that have more than 1000 observations.

This is a nonparametric test of normality. It is more powerful than the Kolmogorov-Smirnov test but much less accurate if there are ties in the data set. The bigger the sample size, the higher the chance for a false positive result. Thus, the software will skip this test if the sample size is larger than 5000.

Take the following data for example.

Group | Value |

A | 0.28 |

A | 0.02 |

A | 0.28 |

A | 0.85 |

A | 0.61 |

A | 0.98 |

A | 0.72 |

A | 1.3 |

A | 0.29 |

A | 0.92 |

B | 0.32 |

B | 0.69 |

B | 0.51 |

B | 0.9 |

B | 0.75 |

B | 0.69 |

B | 0.57 |

B | 0.8 |

B | 0.45 |

B | 0.64 |

The histogram shows the distribution of the input data set and a simulated normal distribution that has the same mean and standard deviation. If Group is provided, the histogram will show the distribution of all the groups instead.

It shows the empirical cumulative distribution function of the input data set and the simulated normal distribution. If Group is provided, it will show the cumulative distribution of all the groups instead.

The table shows the statistics from four different normality tests. The annotation depends on the p value: * for p < 0.05, ** for p < 0.01, *** for p < 0.001, and **** for p < 0.0001. Some journals may have a different requirement for p value annotation. Thus, users can customize the number of asterisks in the edit mode. If Group is provided, the software will generate the table for each group.

The table shows the information that helps construct the cumulative distribution function plot.

1. This is the dataset

Conc A | Conc B | Conc C | Conc D | Conc E | Conc F |

2.05 | 1.59 | 1.63 | 24.73 | 3.54 | 1.66 |

2.29 | 0.92 | 1.95 | 21.21 | 8.55 | 0.91 |

0.84 | 1.08 | 0.13 | 20.06 | 8.68 | 1.34 |

2.92 | 1.43 | 1.08 | 22.87 | 7.14 | 2.01 |

1.87 | 0.99 | 1.11 | 20.28 | 2.96 | 0.73 |

1.38 | 0.82 | 1.46 | 21.88 | 1.32 | 1.04 |

0.43 | 1.75 | 0.06 | 21.2 | 2.93 | 2.06 |

1.06 | 0.79 | 0.23 | 24.88 | 2.66 | 0.78 |

1.06 | 1.41 | 0.38 | 20.21 | 6.52 | 0.52 |

1.14 | 0.82 | 0.34 | 21.04 | 5.77 | 1.51 |

1.02 | 1.28 | 0.13 | 21.15 | 1.87 | 3.16 |

1.33 | 1.22 | 0.51 | 21.58 | 1.23 | 0.92 |

0.58 | 0.7 | 0.41 | 24.39 | 9.37 | 1.07 |

2.56 | 1.82 | 0.89 | 20.81 | 1.91 | 0.33 |

1.75 | 1.37 | 1.41 | 23.58 | 10.12 | 2.22 |

0.3 | 0.44 | 1.23 | 21.98 | 10.83 | 0.08 |

2. Import it

3. Choose the Analysis tab and opt for Correlation

4. Provide the inputs

5. Get the result

This function only accepts the data structure as follows, where each sample is presented in a separate column.

Conc A | Conc B | Conc C | Conc D | Conc E | Conc F |

2.05 | 1.59 | 1.63 | 24.73 | 3.54 | 1.66 |

2.29 | 0.92 | 1.95 | 21.21 | 8.55 | 0.91 |

0.84 | 1.08 | 0.13 | 20.06 | 8.68 | 1.34 |

2.92 | 1.43 | 1.08 | 22.87 | 7.14 | 2.01 |

1.87 | 0.99 | 1.11 | 20.28 | 2.96 | 0.73 |

1.38 | 0.82 | 1.46 | 21.88 | 1.32 | 1.04 |

0.43 | 1.75 | 0.06 | 21.2 | 2.93 | 2.06 |

1.06 | 0.79 | 0.23 | 24.88 | 2.66 | 0.78 |

1.06 | 1.41 | 0.38 | 20.21 | 6.52 | 0.52 |

1.14 | 0.82 | 0.34 | 21.04 | 5.77 | 1.51 |

1.02 | 1.28 | 0.13 | 21.15 | 1.87 | 3.16 |

1.33 | 1.22 | 0.51 | 21.58 | 1.23 | 0.92 |

0.58 | 0.7 | 0.41 | 24.39 | 9.37 | 1.07 |

2.56 | 1.82 | 0.89 | 20.81 | 1.91 | 0.33 |

1.75 | 1.37 | 1.41 | 23.58 | 10.12 | 2.22 |

0.3 | 0.44 | 1.23 | 21.98 | 10.83 | 0.08 |

To calculate the correlation between two variables, you can drag them to the Variable(s) and 2nd variable (optional) placeholder. In the figure below, the user calculates the correlation between Conc A and Conc B.

To calculate the correlation between more than two variables (pairwise), you can drag the whole matrix to the Variable(s) placeholder. In this case, BioVinci will automatically exclude all categorical columns prior to calculation.

This defines the method to calculate the correlation.

The method assumes that both variables are normally distributed

The correlation depends on the difference between the number of concordant pairs and discordant pairs without any assumptions about the distribution.

The correlation depends on the difference between the ranks of corresponding variables without any assumptions about the distribution.

This determines whether the software should transpose the matrix first. Thus, it only works when the input is a matrix.

All the pictures below illustrate the following data set. All parameters are set as default.

Conc A | Conc B | Conc C | Conc D | Conc E | Conc F |

2.05 | 1.59 | 1.63 | 24.73 | 3.54 | 1.66 |

2.29 | 0.92 | 1.95 | 21.21 | 8.55 | 0.91 |

0.84 | 1.08 | 0.13 | 20.06 | 8.68 | 1.34 |

2.92 | 1.43 | 1.08 | 22.87 | 7.14 | 2.01 |

1.87 | 0.99 | 1.11 | 20.28 | 2.96 | 0.73 |

1.38 | 0.82 | 1.46 | 21.88 | 1.32 | 1.04 |

0.43 | 1.75 | 0.06 | 21.2 | 2.93 | 2.06 |

1.06 | 0.79 | 0.23 | 24.88 | 2.66 | 0.78 |

1.06 | 1.41 | 0.38 | 20.21 | 6.52 | 0.52 |

1.14 | 0.82 | 0.34 | 21.04 | 5.77 | 1.51 |

1.02 | 1.28 | 0.13 | 21.15 | 1.87 | 3.16 |

1.33 | 1.22 | 0.51 | 21.58 | 1.23 | 0.92 |

0.58 | 0.7 | 0.41 | 24.39 | 9.37 | 1.07 |

2.56 | 1.82 | 0.89 | 20.81 | 1.91 | 0.33 |

1.75 | 1.37 | 1.41 | 23.58 | 10.12 | 2.22 |

0.3 | 0.44 | 1.23 | 21.98 | 10.83 | 0.08 |

This scatter plot uses two variables to show the data points and a diagonal line spanning from the lowest to the highest values of X and Y. Thus, it only appears when the input has only two numerical columns.

BioVinci will show the pairwise correlation in a heatmap, which only appears when the input is a matrix.

The heatmap shows the pairwise covariance between all columns. It only appears when the input is a matrix.

This table of correlation and covariance only appears if the input has only two numerical columns.

This table provides the information for the Correlation heatmap.

This table provides the information for the Covariance heatmap. It does not show the covariance of a variable with itself (the diagonal line), because it is not meaningful and harms the color scale of the heatmap.

1. This is the dataset

Gene A | Gene B | Gene C | Gene D | Gene E |

6.17 | 3.47 | 2.66 | 1.49 | 0.84 |

6.14 | 3.4 | 2.8 | 1.47 | 0.7 |

4.31 | 4 | 2.03 | 1.94 | 0.16 |

5.28 | 3.09 | 2.25 | 1.76 | 0.78 |

4.96 | 3.53 | 2.85 | 1.43 | 0.22 |

6.87 | 3.82 | 2.57 | 1.63 | 1 |

6.19 | 3.3 | 2.24 | 1.98 | 0.85 |

5.89 | 3.79 | 2.25 | 1.73 | 0.66 |

4.4 | 3.7 | 2.09 | 1.71 | 0.17 |

5.72 | 3.54 | 2.45 | 1.87 | 0.58 |

4.42 | 3.26 | 2.55 | 1.53 | 0.4 |

4.69 | 3.94 | 2.06 | 1.04 | 0.75 |

2. Import it

3. Choose the Analysis tab and opt for Regression

4. Provide the inputs

5. Get the result

The function requires a numerical column for Y and at least one numerical column for X. So your data structure should have at least two numeric columns. We use the data below to illustrate the two types of regression.

Gene A | Gene B | Gene C | Gene D | Gene E |

6.17 | 3.47 | 2.66 | 1.49 | 0.84 |

6.14 | 3.4 | 2.8 | 1.47 | 0.7 |

4.31 | 4 | 2.03 | 1.94 | 0.16 |

5.28 | 3.09 | 2.25 | 1.76 | 0.78 |

4.96 | 3.53 | 2.85 | 1.43 | 0.22 |

6.87 | 3.82 | 2.57 | 1.63 | 1 |

6.19 | 3.3 | 2.24 | 1.98 | 0.85 |

5.89 | 3.79 | 2.25 | 1.73 | 0.66 |

4.4 | 3.7 | 2.09 | 1.71 | 0.17 |

5.72 | 3.54 | 2.45 | 1.87 | 0.58 |

4.42 | 3.26 | 2.55 | 1.53 | 0.4 |

4.69 | 3.94 | 2.06 | 1.04 | 0.75 |

To run this regression, drag your two variables to the X and Y placeholders.

To run a regression with multiple X, you can put all the X into one single matrix using the Create New Variables function. Please refer to B. Create New Variables in Other functions part. After that, you just need to drag the newly created variable to the X placeholder. The other way to provide multiple X is to drag all the column, one by one, to the X placeholder.

The picture below shows how to run the regression between a Y (Gene A) and multiple X (Gene B to E) in both ways.

This is the degree of polynomial. By default, this option is set at 1, which is for linear regression. The software can handle up to 5.

This option determines whether the regression should go through the (0, 0) origin.

This option determines whether the software should transpose the X axis first. Thus, it only works when the input for X is a matrix

This is a scatter plot with the regression, which only appears with simple linear regression (Case 1 mentioned above). The shape of the regression depends on the polynomial.

This plot shows the differences between the observed Y and the predicted Y.

This bar plot shows the p values of coefficients.

This table shows the statistics to evaluate a regression model.

This table shows the information of each coefficient, including the intercept (if exists)

This table shows the regression function.

This table shows the predicted (fitted) values and the differences with the observed values.

- This is the dataset

Group 1 | Group 2 |

1.15 | 0.67 |

0.75 | 0.88 |

0.84 | 0.05 |

1.02 | 0.78 |

1.01 | 0.99 |

0.88 | 0.32 |

0.94 | 0.2 |

1.04 | 0.2 |

1.09 | 0.18 |

1 | 0.05 |

- Import it

- Choose the Analysis tab and opt for the function

- Provide inputs

- Get the result

All the data structures below are valid for this test.

The table has two numerical columns, which are the two samples that you want to compare.

Group 1 | Group 2 |

1.15 | 0.67 |

0.75 | 0.88 |

0.84 | 0.05 |

1.02 | 0.78 |

1.01 | 0.99 |

0.88 | 0.32 |

0.94 | 0.2 |

1.04 | 0.2 |

1.09 | 0.18 |

1 | 0.05 |

Users can drag each column into each placeholder.

You’ve got a table with two columns: A numerical column holding data of both the samples in comparison, and a categorical column with two unique labels to classify the values into each group. The order of these labels does not matter.

Value | Group |

0.79 | Group 1 |

0.78 | Group 1 |

0.97 | Group 1 |

1.08 | Group 1 |

0.9 | Group 1 |

0.87 | Group 2 |

0.78 | Group 2 |

1.05 | Group 2 |

0.8 | Group 2 |

0.79 | Group 2 |

In this case, users need to put the Value into the Column 1 placeholder and Group into the Column 2 placeholder.

Sample ID | Group 1 | Group 2 | ||

1 | 1.15 | 0.75 | 0.67 | 0.32 |

2 | 0.84 | 1.02 | 0.88 | 0.2 |

3 | 0.94 | 1.04 | 0.78 | 0.18 |

5 | 1.09 | 1 | 0.99 | 0.05 |

In this case, you must set the number of replications (which is 2 for this example) when uploading/adding data. After that, BioVinci will automatically create a transformed table, and you can use this to run the test in the same way as Type 2.

This is the most common parametric test that compares two independent samples. The test assesses whether there is any difference between the means of two samples. It assumes all variables are distributed normally and have the same variance.

This test also assesses whether there is any difference between the means of two independent samples. It assumes all variables are distributed normally without necessarily having the same variance.

This test assesses whether there is any difference between the means of two dependent samples. It assumes all variables are distributed normally. The test can only work when the two samples have the same number of observations.

This test is a nonparametric alternative of the Two Sample t-test. It assesses the difference of two samples based on the sum of ranks of all observations. Thus, it does not make any assumptions on the distribution of the samples.

This test is a nonparametric alternative of the Paired t-test. It assesses the difference of two samples based on the difference of ranks of each pairs of observation.(clear?) Similarly, the test can only work when the two samples have the same number of observations.

The test assesses the equality of the variance of two independent samples. It assumes that all samples are distributed normally.

Users can choose whether this is a two-tailed or one-tailed test.

All the tables and graphics below are from the following example data (which is a Type 1 table). All the tests are two-tailed.

Group 1 | Group 2 |

1.15 | 0.67 |

0.75 | 0.88 |

0.84 | 0.05 |

1.02 | 0.78 |

1.01 | 0.99 |

0.88 | 0.32 |

0.94 | 0.2 |

1.04 | 0.2 |

1.09 | 0.18 |

1 | 0.05 |

This is the only visualization for this function. If the two samples are significantly different, the software will add an asterisk to the boxplot. The annotation depends on the p value: * for p < 0.05, ** for p < 0.01, *** for p < 0.001, and **** for p < 0.0001. Some journals may have a different requirement for p value annotation. Thus, users can customize the number of asterisks in the edit mode.

The result table differs among different hypothesis tests.

- Prepare your dataset

Value | Group |

0.78 | Group 1 |

0.17 | Group 1 |

0.2 | Group 1 |

0 | Group 1 |

0.95 | Group 2 |

0.95 | Group 2 |

0.8 | Group 2 |

0.82 | Group 2 |

1.02 | Group 3 |

0.54 | Group 3 |

0.53 | Group 3 |

0.96 | Group 3 |

- Import it

- Choose the Analysis tab and opt for the function

- Drag your data into the placeholders

- Get the result

All the data structures below are valid for this test.

This table contains one numerical column holding values and one categorical column holding group labels of all the observations.

Value | Group |

0.78 | Group 1 |

0.17 | Group 1 |

0.2 | Group 1 |

0 | Group 1 |

0.95 | Group 2 |

0.95 | Group 2 |

0.8 | Group 2 |

0.82 | Group 2 |

1.02 | Group 3 |

0.54 | Group 3 |

0.53 | Group 3 |

0.96 | Group 3 |

To run the test, users can put the numerical column and the categorical column to the Values and Factor placeholders, respectively.

This table contains multiple columns, each of which holds the values of a group (or a sample). Please note that BioVinci requires your table to include ID column in order to run the test, as shown below:

ID | Group 1 | Group 2 | Group 3 |

1 | 0.78 | 0.95 | 1.02 |

2 | 0.17 | 0.95 | 0.54 |

3 | 0.2 | 0.8 | 0.53 |

4 | 0 | 0.82 | 0.96 |

With this structure, users must set the number of replications to 1 when importing/adding the data. After that, BioVinci will automatically create the transformed data, and you can use it to run the test in the same way as Type 1.

This table has replications for each observation.

ID | Group 1 | Group 2 | Group 3 | |||

1 | 0.78 | 0.17 | 0.95 | 0.95 | 1.02 | 0.54 |

2 | 0.2 | 0 | 0.8 | 0.82 | 0.53 | 0.96 |

In this case, users must set the number of replications (which is 2 in this example) when importing/adding the data. After that, BioVinci will automatically create the transformed data, and you can use it to run the test in the same way as Type 1.

This test assesses whether there is no difference among the means of all independent samples. It assumes all samples are normally distributed and have the same variances.

This test assesses whether there is no difference among the mean ranks of all independent samples. It is a non-parametric alternative to One-way ANOVA. It does not assume the normality of the samples.

This test assesses whether all the samples have equal variances. It is essential for many statistical tests holding homogeneity of variances as the underlying assumption. It is sensitive to the departure from normality. So if your samples derive from non-normal distributions, Bartlett’s test is simply just to verify the non-normality.

This test is an alternative to Bartlett’s test, but less sensitive to the departure from normality. You should use this test only when the samples derive from non-normal distributions.

In the meantime, BioVinci can only perform a post-hoc test with One-way ANOVA. These are the single-step multiple comparison procedures. It assesses the pairwise difference between all samples to check whether two samples come from the same distribution/population.

This test is also called the Tukey’s range test or the Tukey’s honest significant difference. Users can perform it in conjunction with the one-way ANOVA. It assumes that all samples are independent, have equal variances, and come from normal distributions.

This test is also called the Fisher’s least significant difference test. It is basically a pairwise t-test on all samples; thus, it inherits all the assumptions of a Two-sample t-test. You should only use this test when the result of one-way ANOVA is significant.

This test is also called the Duncan’s multiple range test. After sorting all the samples by means, it provides the information of ranges. Thus, it is more permissive than the HSD procedure, which provides information of all pairs of samples.

To explain the result of this function, we use the example data below. Please remember that this table belongs to Case 2 (where each column represents a group), so you have to set the number of replications to 1 when you import your data.

ID | Group 1 | Group 2 | Group 3 |

1 | 0.78 | 0.95 | 1.02 |

2 | 0.17 | 0.95 | 0.54 |

3 | 0.2 | 0.8 | 0.53 |

4 | 0 | 0.82 | 0.96 |

Box plots visualize each sample as a box with whiskers. It only shows the asterisk when users choose to use a post hoc test and it has at least one pair that is significantly different.

Similar to box plots, this is another common way to show the results of a multiple comparison.

The table shows the statistical results that support you to evaluate your comparison. It differs among different hypothesis tests.

The table shows the pairwise comparison among samples. The number of asterisks depends on the p value of a particular pair of samples: * for p < 0.05, ** for p < 0.01, *** for p < 0.001, and **** for p < 0.0001. Some journals may have a different requirement for p value annotation. Users can customize the number of asterisks in the edit mode. This table below shows the results of an HSD post-hoc test.

1. This is the dataset

Treatment | Cancer | Control | ||||

Drug A | 0.93 | 1.24 | 0.95 | 1.46 | 0.8 | 0.02 |

Drug B | 2.15 | 1.6 | 1.77 | 1.58 | 1.31 | 1.06 |

placebo | 1.49 | 1.88 | 1.97 | 0.91 | 0.08 | 0.94 |

2. Import it

3. Choose Analysis tab and opt for Two-way ANOVA

4. Provide the inputs

5. Get the results

You’ve got a dataset structured as below: one column that contains all the numerical values, and two categorical columns to classify such values into groups.

Treatment | Phenotype | Value |

Drug A | Cancer | 0.93 |

Drug B | Cancer | 2.15 |

placebo | Cancer | 1.49 |

Drug A | Cancer | 1.24 |

Drug B | Cancer | 1.6 |

placebo | Cancer | 1.88 |

Drug A | Cancer | 0.95 |

Drug B | Cancer | 1.77 |

placebo | Cancer | 1.97 |

Drug A | Control | 1.46 |

Drug B | Control | 1.58 |

placebo | Control | 0.91 |

Drug A | Control | 0.8 |

Drug B | Control | 1.31 |

placebo | Control | 0.08 |

Drug A | Control | 0.02 |

Drug B | Control | 1.06 |

placebo | Control | 0.94 |

With this data structure, you can easily run the function by dragging each column to the appropriate placeholder. More specifically, you should place the numerical column at Values and two categorical columns at the Factor 1 and Factor 2 placeholders.

This is the most common data structure, although it is not canonical. The first column represents the first grouping factor, while the others represent the second.

Treatment | Cancer | Control |

Drug A | 0.93 | 1.46 |

Drug B | 2.15 | 1.58 |

placebo | 1.49 | 0.91 |

Drug A | 1.24 | 0.8 |

Drug B | 1.6 | 1.31 |

placebo | 1.88 | 0.08 |

Drug A | 0.95 | 0.02 |

Drug B | 1.77 | 1.06 |

placebo | 1.97 | 0.94 |

With this kind of data, the user must set the number of replications to 1. After that, you can use the transformed table similar to a Type 1 table.

This quite resembles the Type 2 table, but with replications. First, you need to inform the software the number of replications when you import your dataset. In this example, the number of replication is 3.

Treatment | Cancer | Control | ||||

Drug A | 0.93 | 1.24 | 0.95 | 1.46 | 0.8 | 0.02 |

Drug B | 2.15 | 1.6 | 1.77 | 1.58 | 1.31 | 1.06 |

placebo | 1.49 | 1.88 | 1.97 | 0.91 | 0.08 | 0.94 |

After that, BioVinci will transform your table into a Type 1 table. Now you can simply run the function in the same way as Type 1 mentioned above.

At the moment, only one parameter is available: the Two-way ANOVA (in the Hypothesis test drop list). The next versions may offer more two-way hypothesis alternatives.

BioVinci visualizes the results as a grouped bar plot with error bars. Users can freely switch to a boxplot or a violin plot if appropriate.

BioVinci will display the comparison results in two tables corresponding to each factor.

1. This is the dataset

Gender | Left handed | Right handed |

Male | 44 | 8 |

Female | 47 | 3 |

2. Import it

3. Choose the Analysis tab and opt for 2x2 contingency table

4. Drag your inputs into the placeholder and choose the appropriate test

5. Get the result

All the structures below are valid for this test.

This is the most common structure of a 2x2 contingency table.

Gender | Left handed | Right handed |

Male | 44 | 8 |

Female | 47 | 3 |

This table does not include the grouping column. In this case, the software will assign the two rows to Group 1 and Group 2, respectively.

Left handed | Right handed |

44 | 8 |

47 | 3 |

This drop list allows users to select the appropriate hypothesis test to assess the association between two categorical variables.

This is the most widely used test for independence when using a contingency table to assess how much the observed distribution fits with the expected distribution if variables are independent.

It assumes that the sample data are collected by random sampling from a fixed distribution/population and all observations are independent of each other. This test should not be used when the expected value is less than 5.

The Fisher's exact test assesses whether there is an association between the two variables by comparing the proportions. It assumes that the marginal totals are fixed (conditioned). This test can perform well for small sample size.

Barnard's test is also an exact test that examines the association of two categorical variables. Unlike the Fisher's exact test, it relaxes on one set of the marginal totals and estimates the nuisance parameter. For 2x2 contingency table, it is a more powerful alternative than the Fisher's exact test.

If you have a Type 2 contingency table, you can choose to transpose it before the test.

The visualizations for different hypothesis tests are quite similar. Here are the example data (a Type 1 contingency table) for all the visualizations below.

Gender | Left handed | Right handed |

Male | 44 | 8 |

Female | 47 | 3 |

BioVinci will generate a stacked bar chart to display two proportions.

This is your input contingency table with a color scale.

This heatmap shows the differences of the observed values and expected values. It only appears in Pearson's chi squared test.

This line chart shows how the p value changes with the nuisance parameter. This visualization only appears in Barnard's test.

The summary table shows all the basic statistics, corresponding to the hypothesis test that you choose.

You can find the transformed version of your contingency table on the left hand side. This structure is compatible with all types of basic plots, thus you can use it to construct some plots at your choice.

The table shows the residual values (which compose the Residuals heatmap). It only appears in Pearson's chi squared test.

This table presents the numbers used for the Nuisance parameter line chart. It only shows up in Barnard's test.

Random forest is an ensemble learning method for classification, which constructs a multitude of decision trees. Classification results will be aggregated from all these decision trees.

This section illustrates how to construct the classification model (Random Forest).

1. This is the dataset

https://drive.google.com/file/d/1vknQdQOvpBKJfUYaw2NdiSycNwnP74vS/view?usp=sharing

2. Import it

3. Choose the Analysis tab and opt for Random forest

4. Prepare features

5. Provide the inputs to the placeholders

6. Get the results

We use the iris dataset to explain the inputs of the Random Forest functions.

https://drive.google.com/file/d/1vknQdQOvpBKJfUYaw2NdiSycNwnP74vS/view?usp=sharing

This placeholder should contain a categorical column, which classifies each observation by labels. For the iris dataset (in the example above), you should put the Species column into this placeholder.

This placeholder shall contain a matrix of numerical columns. Each variable (column) is a feature. For the iris dataset, you have to create a table that does not include the Species column (please refer to our detailed instructions on how to create variables). You can also drag multiple numerical columns to this placeholder instead. But the Tranpose option will not work in this case.

The number of decision trees that Random forests generate (Default is 100).

This is the initial number for random algorithms in Random forests (Default is 1). Random forest is stochastic. It is important to set a random state so that your result is reproducible.

We use the Iris dataset as an example for all the instructions below. The dataset contains a set of 150 records under 5 attributes - Petal Length , Petal Width , Sepal Length , Sepal width and Class. You can download the dataset here:

https://drive.google.com/file/d/1vknQdQOvpBKJfUYaw2NdiSycNwnP74vS/view?usp=sharing

The figure presents the error rate of Random Forest method for Iris dataset. The horizontal axis represents the number of trees,the vertical axis represents the error rate (%), while the lines (in various colors) represent groups. When the number of trees changes, the error rate also changes correspondingly.

The software shows the importance of the features for each class in a heatmap (as above).

This is the Confusion matrix for the Iris dataset. An (i, j) cell in the matrix contains the number of samples known to be in group i but predicted to be in group j. In the figure above, 50 samples belonging to Iris-setosa are predicted to be in Setosa group. 5 samples belonging to Iris-versicolor are predicted to be in Iris-virginica.

This heatmap shows the mean decrease in accuracy and the GINI score of each feature.

The table displays the importance of the features in each class and in general. It provides information for the heatmap of Importance matrix and Importance table.

This table shows the residuals of the predictions compared to the observed result. It has two columns: the measurement (residual) and the row index. The order of observations in this table is the same as in the input data.

The table shows the margin of the true class in each observation.

This table shows the classification accuracy of the prediction. The diagonal values are the number of correct classifications, while off diagonal values are incorrect ones. The last column is the error rate.

The table shows the changes in error rates when the number of tree increases. The Feature column lists out the labels for the error rates of all the classes. The General label is the overall error rate.

The table shows how many times a feature was used to create a branch in the decision tree.

k-means clustering is a popular method for cluster analysis in machine learning. k-means clustering aims to partition n samples into k clusters in which each sample belongs to the cluster with the nearest mean.

1. This is the dataset

https://drive.google.com/file/d/1vknQdQOvpBKJfUYaw2NdiSycNwnP74vS/view?usp=sharing

2. Import it

3. Choose Analysis tab and select k-means clustering

4. Drag the input data to appropriate placeholders and set the suitable parameters

5. Get the results

In this section, we use the iris dataset to explain the inputs of the k-mean clustering function.

https://drive.google.com/file/d/1vknQdQOvpBKJfUYaw2NdiSycNwnP74vS/view?usp=sharing

Users need to drag two dimensions, which must be numeric columns, to the A column / matrix and the 2nd coordinate (optional) placeholders, respectively (as with Petal.Width and Sepal.Width in the example below).

If you want to perform k-means clustering for high dimensional data, you need a table where each column is a feature, and drag it to the first placeholder (A column/matrix).

For the iris dataset, users can create a new variable that excludes the Species column (please refer to our detailed instructions on how to create variables). In this case, you do not need to drag any data to the 2nd coordinate (optional) placeholder.

This field defines the number of clusters as well as the number of centroids to generate.

This field defines the initial number for random algorithms in Random forests (Default is 1). k-means clustering is stochastic. It is important to set a random state so that your result is reproducible.

This is a Voronoi diagram. The number of partitions (or colors) is equal to the number of clusters.

This line plot shows the changes of values in each feature. The number of colors indicates the number of clusters (which is 3 in this example).

This heatmap shows the information of the clusters. Sum.of.squares is the sum of squared distances from each data point to the center of each cluster. Size is the number of values in each cluster.

This table provides the information for the Cluster’s information heatmap.

This table shows which cluster each value belongs to.

The table shows the coordinates of the center of each cluster.

1. This is the dataset

Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 | Sample 6 | Sample 7 | Sample 8 | Sample 9 | Sample 10 |

0.48 | 2.62 | 2.69 | 0.61 | 0.07 | 2.97 | 0.71 | 2.76 | 0.8 | 2.77 |

0.99 | 2.02 | 2.15 | 0.82 | 0.18 | 2.3 | 0.24 | 2.46 | 0 | 2.56 |

0.06 | 0.27 | 0.55 | 0.87 | 0.75 | 0.32 | 0.83 | 0.74 | 0.73 | 0.48 |

2.66 | 2.75 | 2.43 | 2.73 | 2.89 | 2.07 | 3 | 0.5 | 2.6 | 0.77 |

2.52 | 2.45 | 2.28 | 2.22 | 2.99 | 2.21 | 2.31 | 0.17 | 2.6 | 0.17 |

0.8 | 2.99 | 2.26 | 0.45 | 0.11 | 2.46 | 0.27 | 2.02 | 0.2 | 2.41 |

0.94 | 2.65 | 2.01 | 0.1 | 0.39 | 2.78 | 0.71 | 2.47 | 0.35 | 2.73 |

0.68 | 0.39 | 0.72 | 0.39 | 0.88 | 0.45 | 0.82 | 0.99 | 0.52 | 0.63 |

2.69 | 2.85 | 2.26 | 2.78 | 2.67 | 2.29 | 2.32 | 0.79 | 2.38 | 0.63 |

2.41 | 2.81 | 2.57 | 2.7 | 2.21 | 2.03 | 2.12 | 0.31 | 2.36 | 0.03 |

2.79 | 2.43 | 2.09 | 2.83 | 2.31 | 2.02 | 2.54 | 0.58 | 2.97 | 0.82 |

2.65 | 2.64 | 2.24 | 2.72 | 2.6 | 2.1 | 2.56 | 0.53 | 2.68 | 0.26 |

0.04 | 2.67 | 2.81 | 0.24 | 0.94 | 2.34 | 0.36 | 2.35 | 0.18 | 2.49 |

0.03 | 2.39 | 2.8 | 0.17 | 0.53 | 2.87 | 0.96 | 2.56 | 0.19 | 2.48 |

0.58 | 0.19 | 0.4 | 0.23 | 0.01 | 0.89 | 0.93 | 0.86 | 0.77 | 0.15 |

2.96 | 2.49 | 2.79 | 2.55 | 2.91 | 2.61 | 2.41 | 0.34 | 2.17 | 0.92 |

2.46 | 2.24 | 2.79 | 2.84 | 2.25 | 2.77 | 2.75 | 0.77 | 2.71 | 0.84 |

0.92 | 2.86 | 2.23 | 0.46 | 0.94 | 2.04 | 0.45 | 2.76 | 0.32 | 2.11 |

0.39 | 2.01 | 2.79 | 0.48 | 0.4 | 2.48 | 0.45 | 2.39 | 0.23 | 2.72 |

0.69 | 0.37 | 0.95 | 0.77 | 0.77 | 0.8 | 0.18 | 0.05 | 0.22 | 0.61 |

2. Import it

3. Choose Analysis tab and select Hierarchical clustering

4. Provide the inputs

5. Get the result

You just need to make sure that all numeric columns in the table are appropriate for the calculation of distances. This function automatically excludes all categorical columns.

For example, you’ve got a table as below.

Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 | Sample 6 | Sample 7 | Sample 8 | Sample 9 | Sample 10 |

0.48 | 2.62 | 2.69 | 0.61 | 0.07 | 2.97 | 0.71 | 2.76 | 0.8 | 2.77 |

0.99 | 2.02 | 2.15 | 0.82 | 0.18 | 2.3 | 0.24 | 2.46 | 0 | 2.56 |

0.06 | 0.27 | 0.55 | 0.87 | 0.75 | 0.32 | 0.83 | 0.74 | 0.73 | 0.48 |

2.66 | 2.75 | 2.43 | 2.73 | 2.89 | 2.07 | 3 | 0.5 | 2.6 | 0.77 |

2.52 | 2.45 | 2.28 | 2.22 | 2.99 | 2.21 | 2.31 | 0.17 | 2.6 | 0.17 |

0.8 | 2.99 | 2.26 | 0.45 | 0.11 | 2.46 | 0.27 | 2.02 | 0.2 | 2.41 |

0.94 | 2.65 | 2.01 | 0.1 | 0.39 | 2.78 | 0.71 | 2.47 | 0.35 | 2.73 |

0.68 | 0.39 | 0.72 | 0.39 | 0.88 | 0.45 | 0.82 | 0.99 | 0.52 | 0.63 |

2.69 | 2.85 | 2.26 | 2.78 | 2.67 | 2.29 | 2.32 | 0.79 | 2.38 | 0.63 |

2.41 | 2.81 | 2.57 | 2.7 | 2.21 | 2.03 | 2.12 | 0.31 | 2.36 | 0.03 |

2.79 | 2.43 | 2.09 | 2.83 | 2.31 | 2.02 | 2.54 | 0.58 | 2.97 | 0.82 |

2.65 | 2.64 | 2.24 | 2.72 | 2.6 | 2.1 | 2.56 | 0.53 | 2.68 | 0.26 |

0.04 | 2.67 | 2.81 | 0.24 | 0.94 | 2.34 | 0.36 | 2.35 | 0.18 | 2.49 |

0.03 | 2.39 | 2.8 | 0.17 | 0.53 | 2.87 | 0.96 | 2.56 | 0.19 | 2.48 |

0.58 | 0.19 | 0.4 | 0.23 | 0.01 | 0.89 | 0.93 | 0.86 | 0.77 | 0.15 |

2.96 | 2.49 | 2.79 | 2.55 | 2.91 | 2.61 | 2.41 | 0.34 | 2.17 | 0.92 |

2.46 | 2.24 | 2.79 | 2.84 | 2.25 | 2.77 | 2.75 | 0.77 | 2.71 | 0.84 |

0.92 | 2.86 | 2.23 | 0.46 | 0.94 | 2.04 | 0.45 | 2.76 | 0.32 | 2.11 |

0.39 | 2.01 | 2.79 | 0.48 | 0.4 | 2.48 | 0.45 | 2.39 | 0.23 | 2.72 |

0.69 | 0.37 | 0.95 | 0.77 | 0.77 | 0.8 | 0.18 | 0.05 | 0.22 | 0.61 |

Then you can either drag the whole table or multiple columns to the Matrix placeholder. But the Transpose option is only available for the former method.

You can choose the clustering method here, among Ward’s minimum variance, Complete linkage, Single linkage, UPGMA, and WPGMA, to determine how BioVinci will calculate the distance between clusters.

In this drop list, you can choose the appropriate metric space for calculating the distance between variables, among Euclidean, Maximum, Manhattan, Canberra and Binary distance.

With this option, users can choose to show a dendrogram on the plot or not.

Please note that BioVinci always performs clustering on the columns. Here you can choose whether it should apply clustering on the rows, too, either with or without the dendrogram.

You can choose to apply log transform for all values. If you select scaling or centering, the software will apply log transform before the scaling or centering function.

This option will transpose the matrix, then run the function will the newly transposed table.

This is a heatmap with dendrogram that shows the clustering results on both rows and and columns.

This heatmap shows the pairwise distance between the columns.

The heatmap shows the pairwise distance between the rows, which only appears when users turn on the Cluster by row option

This table shows the data after clustering. The Row name indicates the original index.

This table provides information for the Column distance heatmap.

The table provides information for the Row distance heatmap.

1. This is the dataset

https://drive.google.com/file/d/1vknQdQOvpBKJfUYaw2NdiSycNwnP74vS/view?usp=sharing

2. Import it

3. Choose the Analysis tab and select Principal component analysis

4. Provide the inputs

5. Get the result

Your data look like this:

ID | Feature 1 | Feature 2 | Feature 3 | Feature 4 | Feature 5 | Class |

Sample 1 | 1.38 | 2.52 | 2.25 | 1.44 | 0.33 | A |

Sample 2 | 0.09 | 2.31 | 0.21 | 1.08 | 1.74 | B |

Sample 3 | 0.93 | 2.16 | 2.49 | 2.22 | 1.68 | B |

Sample 4 | 0.27 | 2.91 | 1.35 | 2.34 | 1.59 | C |

Sample 5 | 1.68 | 0.09 | 0.03 | 0.06 | 1.98 | A |

Sample 6 | 0.87 | 0.09 | 0 | 1.83 | 1.05 | A |

Sample 7 | 2.88 | 2.97 | 0.9 | 1.8 | 2.25 | C |

Sample 8 | 2.04 | 2.85 | 2.34 | 1.23 | 0.15 | B |

Sample 9 | 1.32 | 0.33 | 1.38 | 2.64 | 2.67 | C |

You want to run a PCA where each row is a data point on the plot. In this case, you can drag the whole table to the Features placeholder. The function will automatically remove all categorical columns from this table. If your sample is classified (as shown above in the Class column), you can drag this categorical column to the Class (optional) placeholder. If you don’t, the PCA can show only a single color for all data points.

You can also drag multiple columns to the Features placeholder instead. But the Transpose matrix option will not work in this case.

If you are working with transcriptomic data, this kind of table is common: each row is a gene and each column is a sample. In this case, you want to run a PCA where each column is a data point in the plot.

Gene | Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 |

ENSG00000157782 | 0.48 | 2.62 | 2.69 | 0.61 | 0.07 |

ENSG00000157783 | 0.99 | 2.02 | 2.15 | 0.82 | 0.18 |

ENSG00000157784 | 0.06 | 0.27 | 0.55 | 0.87 | 0.75 |

ENSG00000157785 | 2.66 | 2.75 | 2.43 | 2.73 | 2.89 |

ENSG00000157786 | 2.52 | 2.45 | 2.28 | 2.22 | 2.99 |

ENSG00000157787 | 0.8 | 2.99 | 2.26 | 0.45 | 0.11 |

ENSG00000157788 | 0.94 | 2.65 | 2.01 | 0.1 | 0.39 |

ENSG00000157789 | 0.68 | 0.39 | 0.72 | 0.39 | 0.88 |

ENSG00000157790 | 2.69 | 2.85 | 2.26 | 2.78 | 2.67 |

To run the function, you can drag the whole table to the Features placeholder. Then, check the Transpose matrix box.

In many cases, you also have a metadata table that gives more details on the samples. Please note that, in this metadata table, the order of the sample should be the same as in the data table.

ID | Gender | Age group | Phenotype |

Sample 1 | Male | Elder | Benign |

Sample 2 | Male | Elder | Malignant |

Sample 3 | Female | Adult | Benign |

Sample 4 | Male | Adult | Malignant |

Sample 5 | Female | Adult | Malignant |

If you have a metadata table, import it by clicking at the Add data button on the top left corner (please refer to our instructions on Adding data). Then you can drag a categorical column from that metatable to the Class (optional) placeholder.

This option determines how BioVinci handles missing values. If the filter mode is features (default), features with missing values will be excluded. If the filter mode is samples, samples with missing values will be excluded.

With this option, you can choose to apply log transform to all values. If you select scaling or centering, the software will apply log transform before the scaling or centering function.

This option will transpose the matrix, then run the function will the newly transposed table.

We use the Iris dataset for all the visualizations below.

https://drive.google.com/file/d/1vknQdQOvpBKJfUYaw2NdiSycNwnP74vS/view?usp=sharing

This plot presents samples in 3D space in which the axes are the first three principal components.

1. This is the dataset

https://drive.google.com/file/d/1vknQdQOvpBKJfUYaw2NdiSycNwnP74vS/view?usp=sharing

2. Import it

3. Choose Analysis tab and select t-distributed stochastic neighbour

4. Provide the inputs

5. Get the result

You’ve got a data table as below, each column of which represents a feature, and you want to run a t-SNE where each row is a data point.

ID | Feature 1 | Feature 2 | Feature 3 | Feature 4 | Feature 5 | Class |

Sample 1 | 1.38 | 2.52 | 2.25 | 1.44 | 0.33 | A |

Sample 2 | 0.09 | 2.31 | 0.21 | 1.08 | 1.74 | B |

Sample 3 | 0.93 | 2.16 | 2.49 | 2.22 | 1.68 | B |

Sample 4 | 0.27 | 2.91 | 1.35 | 2.34 | 1.59 | C |

Sample 5 | 1.68 | 0.09 | 0.03 | 0.06 | 1.98 | A |

Sample 6 | 0.87 | 0.09 | 0 | 1.83 | 1.05 | A |

Sample 7 | 2.88 | 2.97 | 0.9 | 1.8 | 2.25 | C |

Sample 8 | 2.04 | 2.85 | 2.34 | 1.23 | 0.15 | B |

Sample 9 | 1.32 | 0.33 | 1.38 | 2.64 | 2.67 | C |

To run this function, you can drag the whole table to the Features placeholder. The function will automatically exclude all categorical columns from your inputs. If your sample is classified (as shown above in the Class column), you can drag this categorical column to the Class (optional) placeholder. If you don’t, the t-SNE plot can show only a single color for all data points.

You can also drag multiple columns to the Features placeholder, instead of the whole table. But the Transpose matrix option will not work in this case.

If you are working with transcriptomic data, you will find this table familiar: each row is a gene and each column is a sample. And you want to run a t-SNE where each column is a data point in the plot.

Gene | Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 |

ENSG00000157782 | 0.48 | 2.62 | 2.69 | 0.61 | 0.07 |

ENSG00000157783 | 0.99 | 2.02 | 2.15 | 0.82 | 0.18 |

ENSG00000157784 | 0.06 | 0.27 | 0.55 | 0.87 | 0.75 |

ENSG00000157785 | 2.66 | 2.75 | 2.43 | 2.73 | 2.89 |

ENSG00000157786 | 2.52 | 2.45 | 2.28 | 2.22 | 2.99 |

ENSG00000157787 | 0.8 | 2.99 | 2.26 | 0.45 | 0.11 |

ENSG00000157788 | 0.94 | 2.65 | 2.01 | 0.1 | 0.39 |

ENSG00000157789 | 0.68 | 0.39 | 0.72 | 0.39 | 0.88 |

ENSG00000157790 | 2.69 | 2.85 | 2.26 | 2.78 | 2.67 |

To run the function, drag the whole table to the Features placeholder. Then, activate the Transpose matrix option.

In many cases, you may have a metadata table that gives more details about the sample as below. Please note that, in this metadata table, the order of the samples should be the same as in the data table.

ID | Gender | Age group | Phenotype |

Sample 1 | Male | Elder | Benign |

Sample 2 | Male | Elder | Malignant |

Sample 3 | Female | Adult | Benign |

Sample 4 | Male | Adult | Malignant |

Sample 5 | Female | Adult | Malignant |

If you have a metadata table, import it by clicking at the Add data button on the top left corner (please refer to our instructions on Adding data). Then you can drag a categorical column from that metatable to the Class (optional) placeholder.

This is the trade-off between speed and accuracy for Barnes-Hut T-SNE. ‘theta’ is the angular size of a distant node as measured from a point. If this size is below ‘theta’ then it is used as a summary node of all points contained within it. This method is not very sensitive to changes in this parameter in the range of 0.2 - 0.8. Angles less than 0.2 have quickly increasing computation time and angles greater than 0.8 have quickly increasing error.

The perplexity sort of says how to balance attention between local and global aspects of your data. The perplexity is related to the number of nearest neighbors that are used in other manifold learning algorithms. Larger datasets usually require a larger perplexity. Consider selecting a value between 5 and 50. The choice is not extremely critical since t-SNE is quite insensitive to this parameter.

This number is used to initialize a random number.

The learning rate for t-SNE is usually from 10 to 1000. If the learning rate is too high, the data may look like a ‘ball’, with any point approximately equidistant from its nearest neighbors. If the learning rate is too low, most points may look compressed in a dense cloud with few outliers. If the cost function gets stuck in a bad local minimum, increasing the learning rate may help.

This is the maximum number of iterations for the optimization. It should be at least 250. By default, BioVinci sets this number at 1000.

This option determines how BioVinci handles missing values. If the filter mode is features (default), features with missing values will be excluded. If the filter mode is samples, samples with missing values will be excluded.

Here you can choose to apply log transform for all values. If you select scaling or centering, the software will apply log transform before the scaling or centering function.

This option will transpose the matrix, then run the function with the newly transposed table.

1. Here are the datasets (2 CSV files)

https://drive.google.com/file/d/1121kl4PX5zjzBkkVehMulBmoZhiNXZm4/view?usp=sharing

https://drive.google.com/file/d/1ilwxRiaiF7hcvzEStTJMWg3zR_FDEvsg/view?usp=sharing

2. Import the first file

3. Import the second file (using the Add data button)

4. Choose Analysis tab and select Sparse canonical correlation analysis

5. Provide the inputs

6. Get the result

You need to provide two data tables for the function to find the correlation between matrices. You can upload an Excel file that contains two sheets, or upload each CSV file manually (import the first one, then use the Add data function to upload the other).

This determines the number of canonical vector pairs obtained from two matrices.

Number of permutations to be run to select the best parameters.

Here you can choose to apply log transform for all values. If you select scaling or centering, the software will apply log transform before the scaling or centering function.

BioVinci can only visualize the results in the Correlation plot, which is a scatter plot with a diagonal line to help you visually estimate the correlation. It also shows an annotation for the correlation and the covariance. The number of plots depends on the number of canonical vectors.

BioVinci uses only the table of non-zero coefficients to list out the results. The number of tables is determined by the number of canonical vectors. Each table shows the coefficients of each row in the two matrices in the corresponding vector.

To use most of the editing functions, click on the Edit plot button right above your plot.

Once you’ve landed on the Edit plot mode, just click on any points and a setting panel will pop up to customize the appearance of the data points on your plot.

This setting adjusts the color or the pattern of your data points; it also allows changing their shape.

With this setting, you can change the line color, line width, and line style.

By default, the Group box is always checked each time you customize something, which means your changes will be applied for the whole group of data on your plot.

To customize a single data point without affecting the others, click on it and tick the Individual box before changing its color and pattern.

To customize the bars, boxes, and violins, users need to click on the object while in the Edit plot mode. The panel below will pop up.

You can adjust the color or pattern of the objects in Fill and the borderline in Stroke. To customize a single object without affecting the others in the same group, click on it and tick the Individual box before changing its color and pattern.

Users can click on the background while in the Edit plot mode to change the background color and opacity. This setting only controls the plot background, not the whole plotting area. In the picture below, just the plot background is gray, while the other parts of the picture are transparent.

Users can hide or show the gridlines by using the Grid line tick box on the top corner in Edit plot. This option affects both X and Y gridlines.

You can click on any grid lines in the plot to customize their width, style, and color. To hide just the horizontal grid lines or vertical grid lines on the plot, click on them, and uncheck the Show grid box.

To customize the zero line, click on it when you’re in the Edit plot mode. You can freely edit its width, style, color, and even choose to hide it.

You can click and drag to move the legend around the plot area. This function is available in both normal mode and edit mode.

Users can hide or show the legend by using the Legend tick box in Edit plot.

Users can edit the content of the legend by using double-clicks while in the edit mode. It only has visual effects and won’t affect the hover information.

The color scale is a special setting for heatmap in the Edit plot mode. Just click on the color scale on the heatmap, then you can choose among the sample scales on the left side, and reverse it at your choice.

BioVinci also allows you to create your own color scale. Simply click at each color, or input the color code (HEX) and hit the plus, one by one. If you’re not satisfied with your scale, click at the Reset button (the rotation arrows) and start over again.

The animation below shows how to customize the color scale both ways.

By using the term Text, we mean all kinds of text on the plot, including the plot title, axis labels, legends, and annotations.

You just need to turn on the Edit plot mode, then click and drag the text to reposition it.

To open this text editing panel, just click Edit plot and simply double click on any text you wish to change. Beside changing the content, you can also customize the font family, font style, size, and color.

To add text, you can use the Add text button in the edit mode.

In the Edit plot mode, users can click on any axis tick values to customize their appearance, the axis range and the number of ticks on the axis.

As shown above, you can adjust the angle by dragging the angle controller or entering the angle by yourself. You can also hide/show the tick values.

To adjust the number of ticks and the axis range, users can enter a suitable number to the text boxes on the right corner.

In the edit mode, users can click on the axis to customize its width, style and color.

All the size options are located on the bottom right corner. Make your plot ready for publication by using the common sizes that we offer. Otherwise, you can customize the size at your own choice.

This setting scales your plot into a square, which has the side length of 8.8 centimeters, or 3.46 inches.

This setting scales your plot into a square, which has the side length of 18.3 centimeters, or 7.2 inches.

This setting scales your plot to fit the screen so you don’t have to scroll to view all the details.

The Custom icon represents the manual mode, where users can freely decide the width, height, and units of measurement (among centimeter, millimeter, inch, and pixel).

If you choose to size your plot by pixel, you should note that the pixel density differs among screens (see more PPI and PPCM). Therefore, a plot measured in pixel may appear different on different machines.

Users can adjust the margins by dragging the margin lines while in the Edit plot mode. This option is useful when you have long tick values that often go off the plotting area. The animation below illustrates how you can customize the margins.

To be ready for publication, a plot must meet many strict requirements: less color, bold title, sans font family, heavy line, etc. But it is time-consuming to edit every single detail.

To tackle this issue, BioVinci offers users 3 publication-ready templates that can be applied for any plot type.

This template meets the most common standards for publication. It uses patterns (stripes, dots, etc.), or shapes (round, triangle, star, plus, etc. ) instead of colors. All the fonts are sans font family. The titles and axes are bold with clear tick labels. Its configuration is also flexible for different settings, even within the same plot type.

This template has almost the same configuration as the Black and white, but it uses shades of gray instead. Each gray intensity is at least 20% different from each other so that the viewers can easily identify the groups.

This template has almost the same configuration as the Black and white but it uses a unique set of colors, which is commonly used in many online articles.

This is the default template. It may not be suitable for publication, but it works well for many other purposes. With interactive objects and zooming functions, you can share and present the results to your partners and collaborators.

Once you’re satisfied with the plot, hit the Export button. This allows you to create a portable PNG image. The Desktop Edition even offers you the vector-based graphic, including PDF, SVG, and EPS.

You can export CSV files of the data tables from analyses as well, which is called Statistic.

Some statistical and machine learning functions require you to prepare an appropriate matrix, which just includes some of the columns from your initial datasheet. In such cases, this function will help.

Here are the instructions for some common cases in data analysis. The instructions use the example data below.

Sample | Tissue | Phenotype | Conc A | Conc B | Conc C | Conc D | Conc E | Conc F |

Patient A | Liver | Benign | 2.05 | 1.59 | 1.63 | 24.73 | 3.54 | 1.66 |

Patient A | Pancreas | Benign | 2.29 | 0.92 | 1.95 | 21.21 | 8.55 | 0.91 |

Patient B | Liver | Benign | 0.84 | 1.08 | 0.13 | 20.06 | 8.68 | 1.34 |

Patient C | Kidney | Benign | 2.92 | 1.43 | 1.08 | 22.87 | 7.14 | 2.01 |

Patient C | Liver | Benign | 1.87 | 0.99 | 1.11 | 20.28 | 2.96 | 0.73 |

Patient C | Pancreas | Benign | 1.38 | 0.82 | 1.46 | 21.88 | 1.32 | 1.04 |

Patient D | Kidney | Benign | 0.43 | 1.75 | 0.06 | 21.2 | 2.93 | 2.06 |

Patient D | Pancreas | Benign | 1.06 | 0.79 | 0.23 | 24.88 | 2.66 | 0.78 |

Patient E | Kidney | Malignant | 1.06 | 1.41 | 0.38 | 20.21 | 6.52 | 0.52 |

Patient E | Liver | Malignant | 1.14 | 0.82 | 0.34 | 21.04 | 5.77 | 1.51 |

Patient F | Kidney | Malignant | 1.02 | 1.28 | 0.13 | 21.15 | 1.87 | 3.16 |

Patient G | Kidney | Malignant | 1.33 | 1.22 | 0.51 | 21.58 | 1.23 | 0.92 |

Patient G | Pancreas | Malignant | 0.58 | 0.7 | 0.41 | 24.39 | 9.37 | 1.07 |

Patient H | Kidney | Malignant | 2.56 | 1.82 | 0.89 | 20.81 | 1.91 | 0.33 |

Patient H | Liver | Malignant | 1.75 | 1.37 | 1.41 | 23.58 | 10.12 | 2.22 |

Patient H | Pancreas | Malignant | 0.3 | 0.44 | 1.23 | 21.98 | 10.83 | 0.08 |

The picture below shows how to remove the first 3 columns: Sample, Tissue, and Phenotype.

Users need to click at Create variable button on the top corner, then choose the Exclude option and list out the column(s) that shouldn’t appear in the new matrix.

But it really takes time if you want to remove many columns. Thus, BioVinci allows excluding a range of columns at once. You can define that range by the From and To fields (see how we exclude the first 3 columns in the picture below).

If you want to remove many columns that do not stay next to each other, you can switch to the Include option and list out the columns that you want to keep.

In many cases, the Exclude text columns option will be very helpful because it removes all categorical columns in your range. You can see how we keep Conc A, Conc C, Conc D, and Conc E columns in the picture below, combining Exclude and Exclude text columns options.

This field is required. It defines the name that shows up in the data tree after you create a matrix. The picture below shows where the variable named “All numbers” appears.

You can list the columns in this field. If you select Include, the new matrix will contain such columns only, while the Exclude option will remove them from your new matrix.

In case of duplicated column names, the software can still understand which columns you want to select. However, we do not recommend duplicated names because some functions may not work properly.

This option is a substitute for the Name box above; thus, you cannot use both of them at the same time. From / To defines which range of columns to keep or exclude, depending on what you select: the Include mode or the Exclude mode.

By default, all rows are selected. You can alter this by entering the starting and ending row on the two sides of the colon (:).

Eg: type “1:12” if you want to specify the range from Row 1 to 12.

If the rows you want to select do not stay next to each other, you can enter several ranges and separate them with a comma (,).

Eg: type “1:12, 14:18, 20:20” if you want to select a range from Row 1 to 20 but not including Row 13 and 19.

Besides the options, users can also select(?) directly on the data review. Each of your moves will change the settings on the right accordingly.

This is useful for small datasets because users can click and drag to select a range directly on the data table. This also changes the settings accordingly.

You can save a plot in a workset and load them anytime in BioVinci. The list of all your saved plots lies at the right hand side, below the home button. We call it the Gallery panel throughout this section.

To load a plot, simply click at its thumbnail on the Gallery panel. If you are working on an unsaved plot, a warning message will pop up.

To save a plot, you can use the Save plot button. A thumbnail of that plot will appear in the Gallery panel.

To delete a plot, you have to load it first. Then you can use the Delete button on the top corner.

The updating function will only appear if you are editing a saved plot. In this case, its thumbnail in the Gallery panel will show a Save icon. You can click on this icon to update your plot.

When you change your data, the thumbnails of the plots affected by such changes will show a warning icon. You can either update these plots or not. If you don’t, please note that these plots will mismatch your current data (which has just been updated).

You can share a plot with your collaborators, even if they do not have BioVinci on their machine through email addresses or social media. The reviewers can only access the interactive figure. Your raw data table and parameters are inaccessible without your authority.

To privately share the plot with your collaborators, enter their email addresses and your message in the text box. Then they will receive a static image of the plot (and your message of course). BioVinci will also send them an URL that links to the landing page, where they can see the interactive figure and leave comments using their Facebook accounts. The picture below shows an example of a landing page.

To share on social media, click the icon(s), and you will be directed to the corresponding platform where you say some words about this. The picture below shows an example of a plot shared via Facebook.

If you are using the Desktop Edition, all data are stored in the folder called BioTData. You can change this location in the software settings.

To create a workset, click at Import your dataset on the main window/page.

You can use the sheet editor to enter or paste your data manually.

BioVinci supports CSV, TSV, XLS, and XLSX formats. If your Excel file has more than one sheet, it will create several data trees in your new workset.

This field determines how to transform your data after importing. If your data contain merged column labels that show repetition in your experiment, you have to input the number of replications in order to transform your matrix into an appropriate one for running statistical functions.

Above is an example of a dataset with 3 replications. In this case, you need to input “3” into the Number of replications field upon importing your data. BioVinci will transform your data into a matrix as shown on the left side of the figure below.

If you don’t have merged labels in your data, you don’t have to worry about this option. By default, the replication is set to 0, which means no data transformation afterwards. You can find more information about this at our blog.

You can delete your workset if you are in the main window/page. Please consider carefully because this action is irreversible.

BioVinci can suggest appropriate statistical function(s) based on the data and the plot type that you are using. These suggestions are located on the top right corner. Users can hover the mouse over the thumbnail to see the suggested functions and click to quickly apply them on the current input data.

Sometimes you need to use more than one table at a time; for example, a metadata table for PCA and t-SNE function, and a second data table for the Sparse canonical correlation. We support you with the Add data button right on the menu bar on top of the software.

The window for adding data is similar to the importing window. You can input your data manually or upload the entire file. All options such as First row as headers and Number of replications work exactly the same as described in the Manage Workset section.

You can switch among all the datasets that you have added. Just click at the gray button on the top left corner to show a drop list of all your sheets.

There are 3 types of columns in BioVinci: numerical, categorical, and unique. You can change the type of a column by clicking on its type icon to open a drop list.

When you change a column type into numerical, BioVinci will try to convert the content to numbers. If the software fails to convert, it will put a null value instead.

If you change a column type into categorical, the software will leave the content as is. Numbers, e.g 1, 0.5, or -2.1, will become strings, e.g “1”, “0.5”, or “-2.1”.

The categorical column is very important because it helps classify the values of the numerical column into groups. But you should check the labels in categorical columns carefully since the classification is case-sensitive. For instance, BioVinci identifies “Benign” and “benign” as two different labels, thus the resulting plot will show both of them as 2 separate groups.

If a column type is unique ID, the software will leave the content as is, similar to categorical. This type is automatically assigned to a column that has all values different from each other (also case-sensitive).

All plot types and functions ignore Unique ID columns, except the tooltips in scatter plot. This prevents users from accidentally selecting such columns to label the observations, which ends up providing too many groups for the software and harms the performance.

Users can double click the column names on the data tree to edit them. However, the plot will not automatically respond to this change. To update the column name on the plot, you need to drag the newly-updated column name to the placeholder again.

You can also use Edit data function to edit the column name(s) (please refer to the next section on Edit data)

To edit your data, you can click on the Edit data button on the upper left corner of the software. This function helps you customize the names of the columns, change the data, remove/add more columns or rows.

If you upload your data with 0 replication, the editing interface will allow you to edit your data. The first row shows the column names and cannot be empty. Any time you delete a column name, the software will generate a column name automatically.

You can right-click and choose among the options available: insert row/column or remove row/column.

If the number of replications is larger than 1, the editing interface will show two matrices: your original one and the transformed one.

Data with replications are processed in a different way compared with normal ones. The transformed matrix on the left shows how BioVinci reads your inputs. You can edit the original data on the right in the same way as Data without replications and see the transformed data changes correspondingly. Please note that the transformed data are not editable, except the column names.

You can quickly change the plot type by using the plot menu bar at the bottom. If there is no input available (all the placeholders are empty), trigger any icon in that menu bar only change the placeholder panel.

If there are input(s) in the placeholder, the software will activate the plot type that are valid with the current input(s). The validity of a plot type depends on the type of the column, which are numerical or categorical. In the picture below, the icons are color of with red, blue, and gray for current, valid, and invalid plot type(s) respectively.

For example, if you are having a grouped bar chart, which has two categorical columns and one numerical column, you can quickly switch the current plot type into bar chart, violin, scatter, and line chart with this function.

This is the list of all the R packages that BioVinci uses for the calculations related to statistics and machine learning. We also include the names of the functions and the versions of the packages.

Package | Version | Function |

plotly | 4.7.1 | All basic plots |

stats | 3.4.2 | All statistical functions; All machine learning functions |

ggplot2 | 2.2.1 | Kernel density curve; |

randomForest | 4.6-12 | Random Forest |

Rtsne | 0.13 | t-SNE |

agricolae | 1.2-8 | Post-hoc test in multiple comparison |

heatmaply | 0.14.2 | Hierarchical clustering |

tseries | 0.10-42 | Jarque-Bera normality test |

nortest | 1.0-4 | Pearson chi-squared normality test |

deldir | 0.1-14 | Voronoi diagram |

car | 2.1-5 | Levene test |

Barnard | 1.8 | Barnard’s test |

PMA | 1.0.9 | Sparse canonical correlation analysis |