class: title-slide, inverse, left, middle background-image: url("UCB-theme/UCB-Cover.jpg") background-size: cover <div class="pull-left" style="width:40%; color:white;"> <h3 style="color:white;">Managing and leveraging knowledge catalogs with TKCat</h3> <a href="https://github.com/patzaw/" target="_blank">Patrice Godard</a> | <a href="https://user2022.r-project.org/" target="_blank">useR!2022</a> | 23 Jun 2022 <span role="img" aria-label="Slide background shows tagline: UCB: Inspired by patient. Driven by science."></span> </div> ??? My name is Patrice Godard. I am a biochemist by training but I turned to bioinformatics during my PhD thesis. I have worked in the biotech and biopharmaceutical industry for 12 years and my main area of expertise is in omics data analysis. I'm going to present you the *TKCat* package which has been developed to manage and leverage, in R, knowledge that has been extracted from external resources or generated from internal research projects. And to this aim, I will discuss specific examples related to the role and the activities of my team within the UCB company. <!-----------------------------------------------------------------------------> <style> .title-slide a { color:white; text-decoration:underline; } .title-slide a:hover { background:white; color:black; } .remark-notes-area .remark-bottom-area .remark-notes-current-area { height:100%; } .remark-notes-preview-area { display:none; } .left-column { width:25%; } .right-column { width:74%; } .comp-title > h2 { margin-bottom:2px; } .code-marg-0 pre { margin:0px; } .code-marg-3 pre { margin:3px; } .code-marg-5 pre { margin:5px; } .full-left > .pull-left { width:100% } .l64 > .pull-left { width:61%; } .l64 > .pull-right { width:37%; } .r70 > .pull-left { width:30%; } .r70 > .pull-right { width:67%; } .confReport > * { margin:0px; } .confReport > h1 { font-size:14px; } .confReport > h2 { font-size:12px; } .confReport > h3 { font-size:10px; } .confReport > p { font-size:8px; } .confReport > ul { font-size:8px; } </style> <!-----------------------------------------------------------------------------> --- layout: true background-image: url("UCB-theme/UCB-logo-foot.png"), url("UCB-theme/UCB-top-left.png"), url("UCB-theme/UCB-bottom-right-grey.png"), url("media/user2022-logo.png") background-position: 3% 97%, 3% 3%, 98% 90%, 98% 3% background-size: 15%, 6%, 6%, 6% <!-----------------------------------------------------------------------------> --- ## Translational Bioinformatics at UCB ??? **UCB** is a biopharmaceutical company focused on creating value for people living with severe diseases, mainly in immunology and neurology. In this context, the translational bioinformatics team participates in **3 main research missions**: *** - The first one is about getting a better understanding of *disease mechanisms* and how they can be experimentally studied *** - The second mission is to identify new relevant *therapeutic approaches* *** - And the third one is to identify *therapeutic opportunities* in patient populations The examples I'll will show today are related to these activities, but the *TKCat* package is also applicable to many other knowledge areas. -- <img src="media/TBN-UCB-1.png" style="position: absolute; top:150px; left:25%; height:450px;" alt="Understanding disease at the cellular and molecular level to identify relevant therapeutic approaches for patient populations"> -- <img src="media/TBN-UCB-2.png" style="position: absolute; top:150px; left:25%; height:450px;" alt="Understanding disease at the cellular and molecular level to identify relevant therapeutic approaches for patient populations"> -- <img src="media/TBN-UCB-3.png" style="position: absolute; top:150px; left:25%; height:450px;" alt="Understanding disease at the cellular and molecular level to identify relevant therapeutic approaches for patient populations"> <!-----------------------------------------------------------------------------> --- ## From data to wisdom <img src="media/Data-to-Wisdom-part.png" style="position: absolute; top:200px; left:80px; width:823px;" alt="Relationships between data (empty dots), information (colored dots), knowledge (connected colored dots), insight (dots of interest) and wisdom (path between dots of interest)" /> <p style="position: absolute; bottom:3%; right:10%; margin:0; font-size:small; font-style:italic;"> <strong>Credit</strong>: from twitter (unknown author) </p> ??? Now, let's clarify what I mean by knowledge, using this image I've seen floating around in social networks. It depicts quite well, among other things, the different kind of assets used or produced in the frame of data analysis projects. On one hand, original **data** is often the output of devices such as an image, a signal or a DNA sequence. **Information** is interpreted data, such as a genetic variant observed in an individual compared to a reference, or the level of expression of one gene in one sample. And **knowledge** can be seen as consistent information which has been combined with relationships identified between the different elements. On the other hand, **insights** correspond to key elements of the knowledge which are of particular importance in a given context. For example, the understanding of a molecular mechanism involved in a condition of interest. And finally, **wisdom** can be seen as relevant paths for leveraging knowledge insights, such as the identification of a new therapeutic approach targeting a relevant molecular pathway. *** Unfortunately, wisdom is not a given, whatever the quality of the upstream assets, and we need to be careful not to take our fantasy for it. *** Anyway, in this presentation, I'm not going to address this point, and I'll focus *wisely* on the knowledge itself: how it can be structured, documented and efficiently leveraged. -- <img src="media/Data-to-Wisdom-full.png" style="position: absolute; top:200px; left:80px; width:999px;" alt="Adding conspiracy theory to the figure (dots connected for drawing a unicorn)"/> -- <svg style="position: absolute; top:198px; left:410px; width:170px; height:260px"> <rect width="170px" height="260px" style="fill:transparent;stroke-width:10;stroke:orange;"></rect> Rectangle highlighting the knowledge related drawing </svg> <!-----------------------------------------------------------------------------> --- ## Expected features of the knowledge to manage ??? More precisely, what are the expected features of the knowledge we propose to manage with *TKCat*? *** First, we deal with many different concepts. In our case it can be: disease, phenotypes, genes, proteins, tissues, organs, cells, organelles, molecular pathways, drugs, gene expression, protein abundance and so on. And these concepts are related to each other in a way which highly depends on the context. That's why documenting the knowledge with a data model is very valuable and the *ReDaMoR* R package is dedicated to this task. *** We also make the hypothesis that most of the data underlying the knowledge can be organized in tables or in matrices. We want to use them in R, of course. But, we also want them to be accessible and reusable in other environments as much as possible. That's why structured folders and text files have been chosen to archive the knowledge and also as one solution to exchange it, especially with external collaborators. *** Developing a system integrating all pieces of knowledge is a very difficult task and highly risky. That's why we prefer to keep the different pieces of knowledge independent but still ready for integration. Indeed, the different pieces of knowledge often refer to similar concepts, like genes or diseases in our case. And those concepts can be used to build bridges across information areas. But these concepts are also often implemented using different scopes and references. For example, some of the resources we manipulate refer to proteins whereas others refer to their coding genes. That's why we have developed dictionaries of concepts which are used to combine and integrate information when needed: *BED* focuses on biological entities, such as genes and proteins, whereas *DODO* focuses on conditions like diseases and phenotypes. *** Data supporting the knowledge can be quite large depending on the scope. And we want, depending on the context, either to use the corresponding tables as a whole or by subsets. Also, the knowledge information is not supposed to be updated very frequently. But it can still be valuable to keep track of the different versions of this knowledge which, by nature, tends to evolve. Finally, some data are more sensitive than the others, or come with license restriction which can limit their use to a set of users. All these different features led to the choice of the *ClickHouse* Database Management System for spreading the knowledge internally. -- - Diverse concepts - Connected concepts <li style="visibility:hidden;"></li> <img src="media/ReDaMoR.png" style="position: absolute; top:130px; left:900px; height:90px;" alt="Logo of the ReDaMoR R package"> -- - Tabular data - To use in R and beyond <li style="visibility:hidden;"></li> <img src="media/folder.png" style="position: absolute; top:230px; left:800px; height:80px;" alt="Logo of a folder"> <img src="media/csv.png" style="position: absolute; top:235px; left:900px; height:70px;" alt="Logo of a csv file"> <img src="media/json.jpg" style="position: absolute; top:235px; left:1000px; height:70px;" alt="Logo of json file"> -- - Non monolytic knowledge: management of independent pieces - Ready for integration on the basis of shared concepts - Diverse implementation of shared concepts <li style="visibility:hidden;"></li> <img src="media/BED.png" style="position: absolute; top:345px; left:850px; height:90px;" alt="BED dictionary icon"> <img src="media/DODO.png" style="position: absolute; top:345px; left:950px; height:90px;" alt="DODO dictionary icon"> <p style="position: absolute; bottom:3%; right:10%; margin:0; font-size:small; font-style:italic;"> <a href="https://f1000research.com/articles/7-195" target="_blank"> BED: a Biological Entity Dictionary </a> | <a href="https://f1000research.com/articles/9-942" target="_blank"> DODO: Dictionary Of Disease Ontologies </a> <span style="visibility:hidden;"> | </span> <a href="https://clickhouse.com/" target="_blank" style="visibility:hidden;"> https://clickhouse.com/ </a> </p> -- - Potentially billions of records - Tables to be used entirely or by subset - No frequent updates but to be versioned - Potential restriction of use <img src="media/ClickHouse.png" style="position: absolute; top:485px; left:865px; height:90px;" alt="Logo of the ClickHouse DBMS"> <p style="position: absolute; bottom:3%; right:10%; margin:0; font-size:small; font-style:italic;"> <a href="https://f1000research.com/articles/7-195" target="_blank"> BED: a Biological Entity Dictionary </a> | <a href="https://f1000research.com/articles/9-942" target="_blank"> DODO: Dictionary Of Disease Ontologies </a> <span> | </span> <a href="https://clickhouse.com/" target="_blank"> https://clickhouse.com/ </a> </p> <!-----------------------------------------------------------------------------> --- ## MDB: a Modeled Database for each knowledge resource <p style="position: absolute; bottom:3%; right:10%; margin:0; font-size:small; font-style:italic;"> <a href="https://github.com/patzaw/ReDaMoR"> https://github.com/patzaw/ReDaMoR </a> | <a href="https://github.com/patzaw/TKCat"> https://github.com/patzaw/TKCat </a> </p> <img src="media/MDB.png" style="position: absolute; top:200px; left:850px; width:250px;" alt="MDB icon"> #### Features ??? Now, I would like to spend time on a key and central type of object in *TKCat* called **MDB**. *MDB* stands for Modeled Database and it is used to organize a specific knowledge resource relying on the following organization. *** First, an *MDB* gathers the **data** themselves, which can be tables or matrices. *** Those data are documented by a **data model** produced using *ReDaMoR*. *** **General information** or metadata are also added to the object: they provide a title, a short description and some references. *** Finally, some tables are annotated as **collection members**. It means that those tables refer to key concepts that can be used to build bridges across different knowledge resources. *** *TKCat* supports different **implementations** of *MDBs* which differ in the way the data are made accessible: either in memory, in files or in the ClickHouse DBMS. *** Finally, *TKCat* stands for **Tailored Knowledge Catalog** And, indeed, it is an R package providing a set of tools for managing and using knowledge resources made available in *MDBs*. Now, I'm going to exemplify how to build and manipulate an *MDB*. -- - **Data**: tables and matrices <img src="media/MDB-t.png" style="position: absolute; top:200px; left:850px; width:250px;" alt="MDB icon"> -- - **Data model**: Formal description of the data <img src="media/ReDaMoR.png" style="position: absolute; top:205px; left:580px; width:90px;" alt="Logo of the ReDaMoR R package"> <img src="media/MDB-m.png" style="position: absolute; top:200px; left:850px; width:250px;" alt="MDB icon"> -- - **Description**: general information about the data <img src="media/MDB-l.png" style="position: absolute; top:200px; left:850px; width:250px;" alt="MDB icon"> -- - **Collections**: tables referring to key concepts <img src="media/MDB-c.png" style="position: absolute; top:200px; left:850px; width:250px;" alt="MDB icon"> <img src="media/MDBs-col.png" style="position: absolute; top:380px; left:750px; width:250px;" alt="Two MDBs with collections"> -- - **Implementation**: in memory, in files, in ClickHouse <img src="media/MDB.png" style="position: absolute; top:200px; left:850px; width:250px;" alt="MDB icon"> <img src="media/hide-mdbs-col.png" style="position: absolute; top:380px; left:749px; width:255px;" alt="White rectangle"> -- #### TKCat - **Tailored Knowledge Catalog**: a package for managing and using MDBs <img src="media/TKCat.png" style="position: absolute; top:450px; left:890px; width:90px;" alt="Logo of the TKCat R package"> <!-----------------------------------------------------------------------------> --- ## Drafting a data model in R with ReDaMoR <p style="position: absolute; bottom:3%; right:10%; margin:0; font-size:small; font-style:italic;"> <a href="https://doi.org/10.1093/nar/gky1105" target="_blank"> HPO: <strong>Köhler et al. (2019)</strong> </a> | <a href="https://patzaw.github.io/ReDaMoR/ReDaMoR.html#41_Drafting_a_data_model_from_data_frames" target="_blank"> <strong>ReDaMoR user guide</strong> > Drafting a data model </a> </p> ```r library(readr) hpo_data_dir <- system.file("examples/HPO-subset", package="ReDaMoR") *HPO_hp <- read_tsv(file.path(hpo_data_dir, "HPO_hp.txt")) *HPO_diseases <- read_tsv(file.path(hpo_data_dir, "HPO_diseases.txt")) *HPO_diseaseHP <- read_tsv(file.path(hpo_data_dir, "HPO_diseaseHP.txt")) ``` ??? For that purpose, I'm going to use some data made available within the *ReDaMoR* package. For this simple example, I'm going to use 3 tables describing the phenotypes associated to different human diseases. This information comes from the **Human Phenotype Ontology project** or *HPO*. As a remember, a phenotype is an individual's observable traits, such as height, eye color or blood type. For example the occurrence of seizures is a strong phenotype of patients suffering of epilepsy. *** Using the *ReDaMoR* package, we can draft a very simple data model of these three tables. Each rectangle represents a table and the bullet points the fields in each of them. So far, excepted the data type, there is no constraint associated to the fields and no relationship identified between the tables. -- ```r library(ReDaMoR) *hpo_model <- df_to_model(HPO_hp, HPO_diseases, HPO_diseaseHP) plot(hpo_model) ``` <iframe src="TKCat-useR2022-Patrice-Godard_files/htmlwidgets_plots/unnamed-chunk-9.html" style="height:200px; width:100%; border-style:none; background-color:transparent;"></iframe> <!-----------------------------------------------------------------------------> --- ## Creating a data model in R with ReDaMoR <p style="position: absolute; bottom:3%; right:10%; margin:0; font-size:small; font-style:italic;"> <a href=" https://pgodard.shinyapps.io/ReDaMoR/" target="_blank"> Try the GUI in shinyapps.io </a> | <a href="https://patzaw.github.io/ReDaMoR/ReDaMoR.html#3_Creating_and_modifying_relational_data_using_the_graphical_user_interface" target="_blank"> <strong>ReDaMoR user guide</strong> > Creating a data model </a> </p> ```r hpo_model <- model_relational_data(hpo_model) ``` <img src="media/model_relational_data.png" style="position:absolute; width:1000px; top:200px; left:64px; z-index:1; border:solid 3px black" alt="The model_relational_data() graphical user interface"> <img src="media/model_relational_data-top.png" style="width:1000px; position:absolute; top:203px; left:67px; z-index:3;" alt="The model_relational_data() graphical user interface"> <iframe src="TKCat-useR2022-Patrice-Godard_files/htmlwidgets_plots/unnamed-chunk-11.html" style="position:absolute; top:250px; left:80px; height:355px; width:550px; border-style:none; background-color:transparent; z-index:2;"></iframe> ??? To add such information, we can use the `model_relational_data()` function which will launch a graphical user interface developed for creating and manipulating relational data models. In this case for example, we have made all the fields of all tables non nullable. Excepted the *description* field of the hp table which remains between brackets in this graphical representation. The fields in bold correspond to primary keys of the tables. We have also changed the type of a few fields and added relevant relationships between the tables. To summarize, the phenotypes are described in the *hp* table whereas the diseases are described in the *disease* table. The *diseaseHP* table is an association table between diseases and phenotypes, each disease potentially presenting several phenotypes, and each phenotype being potentially presented by several diseases, as indicated by the cardinalities associated to the relationships. <!-----------------------------------------------------------------------------> --- ## Confronting data to the model <p style="position: absolute; bottom:3%; right:10%; margin:0; font-size:small; font-style:italic;"> <a href="https://patzaw.github.io/ReDaMoR/ReDaMoR.html#4_Confronting_data" target="_blank"> <strong>ReDaMoR user guide</strong> > Confronting data </a> </p> <div class="l64 code-marg-0"> .pull-left[ ```r confront_data(hpo_model, data=list( "HPO_hp"=HPO_hp, "HPO_diseaseHP"=HPO_diseaseHP, "HPO_diseases"=HPO_diseases )) ``` <p style="text-align:center;"> <img src="media/hpo_model-3tables.png" style="width:500px;" alt="HPO model limited to 3 tables"> </p> ] .pull-right[ <div class="confReport" style="height:240px; overflow: auto; background-color:#FFFACD; border-top:5px solid black; border-bottom:5px solid black;"> <h1>Confrontation with original data</h1> <p><span style="background-color:red; color:white; padding:2px;">FAILURE</span></p> <h2>Check configuration</h2> <ul> <li><strong>Optional checks</strong>: unique, not nullable, foreign keys</li> <li><strong>Maximum number of records</strong>: Inf</li> </ul> <h2>HPO_hp</h2> <p><span style="background-color:red; color:white; padding:2px;">FAILURE</span></p> <h3>Field issues or warnings</h3> <ul> <li>description: <span style="background-color:green; color:black; padding:2px;">SUCCESS</span> <span style="background-color:#FFBB33; color:white; padding:2px;">Missing values 117/500 = 23%</span></li> <li>level: <span style="background-color:red; color:white; padding:2px;">FAILURE</span> <span style="background-color:#FFBB33; color:white; padding:2px;">Unexpected “numeric”</span></li> </ul> <h2>HPO_diseases</h2> <p><span style="background-color:red; color:white; padding:2px;">FAILURE</span></p> <h3>Field issues or warnings</h3> <ul> <li>id: <span style="background-color:red; color:white; padding:2px;">FAILURE</span> <span style="background-color:#FFBB33; color:white; padding:2px;">Unexpected “numeric”</span></li> </ul> <h2>HPO_diseaseHP</h2> <p><span style="background-color:red; color:white; padding:2px;">FAILURE</span></p> <h3>Field issues or warnings</h3> <ul> <li>id: <span style="background-color:red; color:white; padding:2px;">FAILURE</span> <span style="background-color:#FFBB33; color:white; padding:2px;">Unexpected “numeric”</span></li> </ul> </div> ] .pull-left[ ```r HPO_hp <- mutate(HPO_hp, level=`as.integer(level)`) HPO_diseases <- mutate(HPO_diseases, id=`as.character(id)`) HPO_diseaseHP <- mutate(HPO_diseaseHP, id=`as.character(id)`) confront_data(hpo_model, data=list( "HPO_hp"=HPO_hp, "HPO_diseaseHP"=HPO_diseaseHP, "HPO_diseases"=HPO_diseases )) ``` ] .pull-right[ <div class="confReport" style="height:150px; overflow: auto; background-color:#FFFACD; border-top:5px solid black; border-bottom:5px solid black; margin-top:20px;"> <h1>Confrontation with corrected data</h1> <p><span style="background-color:green; color:black; padding:2px;">SUCCESS</span></p> <h2>Check configuration</h2> <ul> <li><strong>Optional checks</strong>: unique, not nullable, foreign keys</li> <li><strong>Maximum number of records</strong>: Inf</li> </ul> <h2>HPO_hp</h2> <p><span style="background-color:green; color:black; padding:2px;">SUCCESS</span></p> <h3>Field issues or warnings</h3> <ul> <li>description: <span style="background-color:green; color:black; padding:2px;">SUCCESS</span> <span style="background-color:#FFBB33; color:white; padding:2px;">Missing values 117/500 = 23%</span></li> </ul> </div> ] </div> ??? Once the data model is created it can be confronted to the data using the `confront_data()` function. This function returns a report that helps to efficiently correct the data or the model when needed. <!-----------------------------------------------------------------------------> --- class: comp-title ## Creating and using an MDB with TKCat <p style="position: absolute; bottom:3%; right:10%; margin:0; font-size:small; font-style:italic;"> <a href="https://patzaw.github.io/TKCat/TKCat-User-guide.html#2_Create_an_MDB:_a_minimal_example" target="_blank"> <strong>TKCat user guide</strong> > Create an MDB </a> | <a href="https://patzaw.github.io/TKCat/TKCat-User-guide.html#3_Leveraging_MDB" target="_blank"> <strong>TKCat user guide</strong> > Leveraging MDB </a> </p> .pull-left[ #### MDB creation ```r library(TKCat) hpo <- `memoMDB`( `dataTables`=list( "HPO_hp"=HPO_hp, "HPO_diseases"=HPO_diseases, "HPO_diseaseHP"=HPO_diseaseHP ), `dataModel`=hpo_model, `dbInfo`=list( name="miniHPO", title="Very small extract of the human phenotype ontology", description="For demonstrating ReDaMoR and TKCat capabilities...", url="https://hpo.jax.org/app/", version="0.1", maintainer="Patrice Godard <patrice.godard@gmail.com>" ) ) ``` ] ??? The confrontation of the data model to the data also occurs when creating an *MDB*, which is achieved as shown on the left. The process of creation of an *MDB* follows the main feature of this type of object I've described before. It takes the data tables, the data model, and some general information. *** The content of an *MDB* can then be easily explored and retrieved as exemplified on the right. Intuitively, the `db_info()` function returns general information, and the `data_model()` function returns the data model. The `select()` function is used to focus the *MDB* on a few tables and the `pull()` function to extract a specific table. -- .pull-right[ #### Explore and retrieve information ```r db_info(hpo) data_model(hpo) ``` ```r hpo %>% select(HPO_diseases, HPO_diseaseHP) hpo %>% pull(HPO_diseases) %>% head(3) ``` ``` ## # A tibble: 3 × 3 ## db id label ## <chr> <chr> <chr> ## 1 DECIPHER 15 NF1-microdeletion syndrome ## 2 DECIPHER 45 Xq28 (MECP2) duplication ## 3 DECIPHER 65 ATR-16 syndrome ``` ] <!-----------------------------------------------------------------------------> --- class: code-marg-5 ## Leverage the MDB data model: `filter` <p style="position: absolute; bottom:3%; right:10%; margin:0; font-size:small; font-style:italic;"> <a href="https://patzaw.github.io/TKCat/TKCat-User-guide.html#35_Filtering_and_joining" target="_blank"> <strong>TKCat user guide</strong> > Filtering and joining </a> </p> .pull-left[ ```r dims(hpo) %>% select(name, nrow) ``` ``` ## # A tibble: 3 × 2 ## name nrow ## <chr> <int> ## 1 HPO_hp 500 ## 2 HPO_diseases 1903 ## 3 HPO_diseaseHP 2594 ``` ] .pull-right[ ```r data_model(hpo) %>% plot() ``` <iframe src="TKCat-useR2022-Patrice-Godard_files/htmlwidgets_plots/unnamed-chunk-26.html" style="height:120px; width:100%; border-style:none; background-color:transparent;"></iframe> ] <p style="margin:5px; visibility:hidden;">_</p> ```r fhpo <- hpo %>% `filter(HPO_hp=stringr::str_detect(description, "eye"))` ``` ??? More interestingly, the data model can be used to filter the data **transitively** through all tables. On the right you have the original data model of the *MDB* we have just constructed. And on the left you have the number of rows in each table. The aim of the highlighted command is to filter the *hp* table to keep rows with the word "eye" found in the description field. The idea is to get all the phenotypes regarding the *eyes*. *** After applying this filter, the data model did not change. However, on the bottom left you can see that the number of rows of the *hp* table decreased but also the number of rows of the two other tables which were filtered accordingly. -- .pull-left[ ```r fhpo %>% dims() %>% select(name, nrow) ``` ``` ## # A tibble: 3 × 2 ## name nrow ## <chr> <int> ## 1 HPO_hp 21 ## 2 HPO_diseaseHP 109 ## 3 HPO_diseases 99 ``` ] .pull-right[ ```r data_model(fhpo) %>% plot() ``` <iframe src="TKCat-useR2022-Patrice-Godard_files/htmlwidgets_plots/unnamed-chunk-32.html" style="height:120px; width:100%; border-style:none; background-color:transparent;"></iframe> ] <!-----------------------------------------------------------------------------> --- class: code-marg-5 ## Leverage the MDB data model: `join` <p style="position: absolute; bottom:3%; right:10%; margin:0; font-size:small; font-style:italic;"> <a href="https://patzaw.github.io/TKCat/TKCat-User-guide.html#35_Filtering_and_joining" target="_blank"> <strong>TKCat user guide</strong> > Filtering and joining </a> </p> .pull-left[ ```r dims(fhpo) %>% select(name, nrow) ``` ``` ## # A tibble: 3 × 2 ## name nrow ## <chr> <int> ## 1 HPO_hp 21 ## 2 HPO_diseaseHP 109 ## 3 HPO_diseases 99 ``` ] .pull-right[ ```r data_model(fhpo) %>% plot() ``` <iframe src="TKCat-useR2022-Patrice-Godard_files/htmlwidgets_plots/unnamed-chunk-35.html" style="height:120px; width:100%; border-style:none; background-color:transparent;"></iframe> ] <p style="margin:5px; visibility:hidden;">_</p> ```r jhpo <- fhpo %>% `join_mdb_tables(c("HPO_hp", "HPO_diseaseHP", "HPO_diseases"))` ``` ??? The data model can also be used to automatically join tables of interest. For example, after having focused the HPO *MDB* on *eyes* related phenotypes, it can be useful to put diseases directly in front of their corresponding phenotypes. This is achieved by the highlighted command. *** This command alters the data model as shown at the bottom right while keeping all the records as you can see at the bottom left. -- .pull-left[ ```r jhpo %>% dims() %>% select(name, nrow) ``` ``` ## # A tibble: 1 × 2 ## name nrow ## <chr> <int> ## 1 HPO_hp 109 ``` ] .pull-right[ ```r data_model(jhpo) %>% plot() ``` <iframe src="TKCat-useR2022-Patrice-Godard_files/htmlwidgets_plots/unnamed-chunk-41.html" style="height:120px; width:100%; border-style:none; background-color:transparent;"></iframe> ] <!-----------------------------------------------------------------------------> --- ## MDB implementations <p style="position: absolute; bottom:3%; right:10%; margin:0; font-size:small; font-style:italic;"> <a href="https://patzaw.github.io/TKCat/TKCat-User-guide.html#26_Writing_an_MDB_in_files" target="_blank"> <strong>TKCat user guide</strong> > MDB in files </a> | <a href="https://patzaw.github.io/TKCat/TKCat-User-guide.html#32_MDB_implementations" target="_blank"> <strong>TKCat user guide</strong> > MDB implementations </a> </p> <div style="height:25px; visibility:hidden;">a</div> <img src="media/double-bent-arrow-l.svg" style="position: absolute; top:150px; left:80px; width:1000px; height:75px" alt="Double arrow"> <img src="media/double-bent-arrow-s.svg" style="position: absolute; top:165px; left:300px; width:175px; height:55px" alt="Double arrow"> <img src="media/double-bent-arrow-s.svg" style="position: absolute; top:165px; right:400px; width:175px; height:55px" alt="Double arrow"> <div class="full-left" style="float:left; width:350px"> .pull-left[ #### In memory (`as_memoMDB()`) - All the data loaded in R memory <br> - Fast but greedy <br> - Convenient for using whole tables ] </div> <div class="full-left" style="float:left; width:350px; margin-left:16px;"> .pull-left[ #### In files (`as_fileMDB()`) - Data in files until requested (`pull()`, `filter()`, ...) - Not convenient for subsetting (slow) - Convenient for archiving and sharing ] </div> <div class="full-left" style="float:right; width:350px"> .pull-left[ #### In ClickHouse DBMS (`as_chMDB()`) - Data in DBMS until requested (`pull()`, `filter()`, ...) - Efficient to get subsets <br> (`get_query()`) - Convenient for sharing and managing access - Versioning ] </div> ??? As mentioned before, the *TKCat* package supports three main implementations of *MDBs* that can be easily interconverted. Each implementation presents features that make it more or less relevant for different usages. In *memoMDB* all the data are loaded in memory. It makes data manipulation fast but greedy. It is convenient when the data are small or when we need to work with the whole tables. /// In *fileMDB* the data remain in text files until requested. It saves memory and allows to load only a few tables of interest when needed. However it makes data filtering slow. The main purpose of *fileMDB* is to archive the MDB and to share it with external collaborators. /// Finally, in *chMDB* the data are stored in a *ClickHouse* database. It's very efficient to load the whole tables when needed, but also to get only records of interest by sending *SQL* queries. Moreover, mechanisms have been implemented for managing access rights and for supporting the versioning of *chMDB*. The main purpose of *chMDB* is to provide an easy and flexible access to the data, which is quite convenient for sharing them internally. <!-----------------------------------------------------------------------------> --- ## TKCat: a data warehouse management system <p style="position: absolute; bottom:3%; right:10%; margin:0; font-size:small; font-style:italic;"> <a href="https://patzaw.github.io/TKCat/TKCat-User-guide.html#42_chTKCat" target="_blank"> <strong>TKCat user guide</strong> > chTKCat </a> | <a href="https://patzaw.github.io/TKCat/TKCat-User-guide.html#51_chTKCat_operations" target="_blank"> <strong>TKCat user guide</strong> > chTKCat operations </a> </p> .left-column[ ```r k <- chTKCat( host="localhost", user="default", password="" ) explore_MDBs(k) ``` ] .right-column[ <img src="media/chTKCat-resources.png" style="width:100%;" alt="explore_MDBs(k) graphical user interface: available knowledge resources"> ] ??? Multiple *chMDB* can be stored in the same *ClickHouse* instance, providing therefore a standard access to many knowledge resources. And they can be accessed via a *chTKCat* object which is a special connector to *ClickHouse*. The `explore_MDBs()` function opens a shiny user interface which can be used to browse available knowledge resources. *** When a resource is selected, its data model can be explored and tables can be previewed and potentially downloaded. This application can also be deployed to improve the awareness of available resources within a community. -- <div style="position: absolute; top:125px; left:200px; width:810px; padding:5px; border:solid black 3px; background:white;"> <img src="media/chTKCat-model.png" style="width:100%" alt="explore_MDBs(k) graphical user interface: explore the model of a specific resource" /> </div> <!-----------------------------------------------------------------------------> --- class: code-marg-5 ## Merging MDBs with collections <p style="position: absolute; bottom:3%; right:10%; margin:0; font-size:small; font-style:italic;"> <a href="https://doi.org/10.1093/nar/gkx1153" target="_blank"> ClinVar: <strong>Landrum et al. (2018)</strong> </a> | <a href="https://patzaw.github.io/TKCat/TKCat-User-guide.html#36_Merging_MDBs_with_collections" target="_blank"> <strong>TKCat user guide</strong> > Merging with collections </a> </p> <iframe src="TKCat-useR2022-Patrice-Godard_files/htmlwidgets_plots/unnamed-chunk-45.html" style="height:470px; width:95%; border-style:none; background-color:transparent;"></iframe> ??? *Collections* are a powerful mechanism to merge different knowledge resources based on shared concepts. For the sake of time, I'm not going to describe this mechanism today. You can find more information about it in the documentation. Quickly, here you can see the results of merging 2 MDBs sharing the concept of diseases. Thanks the DODO dictionary, an association table, in yellow, has been automatically created between the tables providing disease references. <!-----------------------------------------------------------------------------> --- ## Supported data types - character, numeric, integer, logic, Date, POSIXct (time) ??? I would like to finish with a comment regarding supported data types in *ReDaMoR* and *TKCat*. Currently, the following **canonical R types** are supported. *** In addition, *TKCat* allows the storing of **files** in *MDBs* using *base64* encoding. *** Finally, **matrices and sparse matrices** are also supported by *TKCat*. In the data model they are represented as tables with 3 fields: - one for row names - one for column names - and one for the values -- <div class="r70"> .pull-left[ - base64 (file) ] .pull-right[ <img src="media/MetaBase_Imagemaps.png" style="width:100%; border:solid 3px black;" alt="Example of files stored in base64 character"> ] -- .pull-left[ - matrix and sparse matrix ] .pull-right[ <img src="media/snRNASeq-matrix.png" style="width:100%; border:solid 3px black;" alt="Example of matrix"> ] </div> <!-----------------------------------------------------------------------------> --- class: comp-title ## Acknowledgements <div style="width:85%; margin-left:auto; margin-right:0%;"> .pull-left[ ### Supporting tools - [tidyverse](https://www.tidyverse.org/) and related packages - [visNetwork](https://datastorm-open.github.io/visNetwork/) - [shiny](https://shiny.rstudio.com/) and related packages - [ClickHouse](https://clickhouse.com/) and [RClickhouse](https://github.com/IMSMWU/RClickhouse) - [Matrix](https://cran.r-project.org/package=Matrix) - **Many others**: - [ReDaMoR dependencies](https://github.com/patzaw/ReDaMoR#dependencies) - [TKCat dependencies](https://github.com/patzaw/TKCat#dependencies) - [CRAN](https://cran.r-project.org/) ] ??? Obviously, *ReDaMoR* and *TKCat* packages were developed relying on existing tools. I wanted to highlight here those of particular importance for the success of this project. However, it's not an exhaustive list that you can eventually find on the CRAN or on github repositories. I would also like to take this opportunity to thank the CRAN team for their efficiency in the publication of those R packages. *** To conclude, I would like to thank people from my team at UCB who have supported this project at different levels. Thanks to the organizers of the *useR!* conference who gave me the opportunity to present this work, and thank you all for the attention you gave to this presentation... -- .pull-right[ ### UCB team #### Managers and Developers - Jonathan van Eyll - Liesbeth François - Yuliya Nigmatullina #### Users and testers - Aurélie Bousard - Olga Giannakopoulou - Ioana Cutcutache - Bram Van de Sande - Waqar Ali - John Santa Maria ] </div> <!-----------------------------------------------------------------------------> --- layout: false count: false background-image: url("UCB-theme/UCB-logo-foot.png"), url("media/user2022-logo.png") background-position: 3% 97%, 98% 3% background-size: 15%, 6% <img src="UCB-theme/UCB-bottom-right-big.png" style="position:absolute; right:0px; bottom:0px; height:100%;"> <a href="https://patzaw.github.io/TKCat/useR2022/TKCat-useR2022-Patrice-Godard.html" target="_blank"> <img src="media/slides-link.svg" style="position:absolute; right:5px; bottom:5px; height:100%;"> </a> <div style="position:absolute; width:60%; left:20%; top:120px; text-align:center;"> <a href="https://patzaw.github.io/TKCat/useR2022/TKCat-useR2022-Patrice-Godard.html"> <img src="media/slides-qr-code.png" style="width:400px"> </a> </div> <div style="position:absolute; left:2%; top:13%;writing-mode:vertical-rl; transform:rotate(-180deg);"> Slides created with <a href="https://github.com/yihui/xaringan" target="_blank">xaringan</a> and <a href="https://pkg.garrickadenbuie.com/xaringanthemer/" target="_blank">xaringanthemer</a> </div> ??? Don't hesitate to contact me if you have any question. <!----------------------------------------------------------------------------->