This document describes the general approach and design of redoc for developers interested in contributing.
Two-way R Markdown workflows are challenging because R Markdown and
knitr workflows are lossy - the compiled document does
not contain all of the information in the source. Also, we are limited
by information that can be passed via pandoc from markdown
to final formats and in reverse.
To produced a Reversible Reproducible Document in Word (a “redoc”),
the redoc() format first pre-parses the source
.Rmd file. knitr doesn’t expose its parser
to developers, so I’ve lifted most of the code for this parser from
knitr and rmarkdown. The parser
captures YAML headers, code chunks, and inline code, giving names to
unnamed chunks and inline code sections and wrapping them in named
<div> and <span> tags with unique
id values and the class "redoc". The contents
of those sections are stored in a file called
filename.codelist.yml.
redoc() then knits the .Rmd file. Code
output is wrapped within the same <span> and
<div> tags as the original chunks.
When the knitted document is converted to a .docx by
pandoc, redoc passes it through a series of pandoc lua filters (found
in inst/lua-filters). These do three things:
<span> and
<div> tags of class redoc to hidden custom styles
with names corresponding to their unique IDs so that they are retained
in the Word document.Then using rmarkdown::output_format()’s
post_processor argument and functions from officer, the
original .Rmd and the codelist.yml file are
stored in the Word document. As .docx files are just ZIP
archives, this is straightforward, except that some metadata must be
added to ensure Word preserves these files when editing.
If the option diagnostics=TRUE is set, information about
the R session and current software versions is also stored in the Word
document for later debugging.
If highlight_output=TRUE is set, the post-processor also
modifies all Word document styles to color the redoc-class
sections.
When the dedoc() function is run, it extracts the
*.codelist.yml file from the .docx file.
Then pandoc is used to convert the docx back to
markdown. A custom lua
filter converts any track-changes text to Critic Markup, and another
lua filter replaces any elements with the custom redoc
styles with placeholders of the form [[chunk-id]].
dedoc() then uses the data in the *.chunks.yml
file to replace these placeholders with original chunk (or inline code).
In the event that chunk output has been deleted or modified beyond
recognition, redoc tries to be smart about its
placement, placing it near its original location. Depending on the
policies selected via dedoc()’s block_missing
or inline_missing arguments, the restored code may be
wrapped in an HTML comment or not restored at all.
redoc() is based on
rmarkdown::word_document(), and can similarly be
extended.
The simplest form of extension is defining additional parts of the
document to be wrapped and stored in the *.codelist.yml
file. These are defined in as a list of functions in the
wrappers argument of redoc(). Each function
captures a type of code, and by default these are R chunks and inline
code, HTML comments, YAML blocks, some LaTeX, pandoc-style
citations, and pandoc raw
spans and blocks.
You can capture other types of code by adding additional functions,
which are detailed in the ?wrappers documentation. If the
code is simple enough to be captured with a regular expression, these
functions can be generated with with make_wrapper().
When building additional formats based on redoc(), it is
important to use the base_format option of
rmarkdown::output_format(). rmarkdown will
then merge the post_processor functions of
redoc() and your format so that redoc()'s runs
after your custom post-processor.
Future versions of redoc() will include a reversible
version of officedown::rdocx_document().