An Experiment in Document Conversion and Generation

This is the README file for the Github repository that holds the files used and created in this experiment. I’m including the README in its entirety since it kills 2 birds with 1 stone.


1. Introduction

This repo holds a set of files that I created as an experiment in getting old work out of proprietary formats. The idea is to take a MSFT Word file and convert it into something that is human readable, open formatted, and convertible.

To do this is I settled upon AsciiDoc to mark up the text of the paper. I chose AsciiDoc over Markdown because of the depth of features and availability of conversion tools.


2. The Process

I decided to use a local install of Etherpad Lite (EL) as my primary text editor for this project. I did this because of a few features including autosave, versioning, and the potential for real time collaboration. I hoped that these features would provide me with a useful editing tool.

Once EL setup and configured I was faced with the problem of how to get the text of the paper into the editor in the first place. My initial inclination was to retype the document, formating and editing as I went along. Faced with a 10,000 word doc and no appreciable typing skills, I was not happy with this option. After a bit of poking around in EL I found its import features. To get MSFT Word files imported required a bit more configuring, but it worked. I then imported the Word file into EL.

The import process added the text of the document to the editor. It stripped all of the formatting from the text and inserted the 112 footnotes in-line into the text. All of this was actually a good thing, making the process of marking up the doc with AsciiDoc easier. Using the original word processing file as a guide I worked through the document adding the necessary AsciiDoc markup to format the paper. The most tedious part was the 112 footnotes, but since AsciiDoc handles footnote with in-line markup it moved along as fast as could be expected.

In total I spent about 6 hours working on the AsciiDoc version of the document. The most time was spent tagging footnotes and figuring out the format for the bibliography
[I am still not really pleased with the way the biblio looks. I think I can fix though on a later iteration.]
The rest of the formating such as section titles, quotes, emphasis, and lists was straight forward though I did keep a copy of the AsciiDoc User Guide open in another tab to help out.

I found the Etherpad Lite interface easy to work with and really appreciated the autosave and versioning features. EL doesn’t know about AsciiDoc markup though so that presented a challenge. In order to preview the work I had to export the file as text and then do the basic AsciiDoc to HTML, opening the resulting file in another browser tab to see what was going on. As I became more confident of my work, I checked less often so this was not much of an issue. I marked major revisions as saved revisions at the end of section of the document to give me a nice clean revision history.

Once I had a nice clean version that produced good HTML, I exported a final copy to my local computer and set about using the AsciiDoc utility a2x to generate the document in various formats. For this particular experiment I went with XHTML, PDF, and EPUB. The generation/conversion process was marred only by my problems with understanding the format for the bibliography at the end of the document. Once I figure out just how to mark up the bibliography process was flawless. a2x first converts the AsciiDoc marked document into a DocBook XML file and then converts the DocBook file into other formats. The process uses the standard set of XML processing tools as well as CSS to generate the files. By using custom CSS files, the layout and formating of the various output files can be changed as needed.


3. The Files

The files included in this repo are the ones used and generated as part of the process described above.

KELSOFIN20130111.docx The MSFT Word file that was used for the starting point. This document began as a WordPerfect file in 1992 and was moved to Word in the mid-90’s.
KelsoPaper.txt This is the AsciiDoc version of the file as created and edited in Etherpad Lite. This is the file used to generate the other formats.
KelsoPaper.pdf PDF file generated from KelsoPaper.txt using the command a2x -v -f pdf KelsoPaper.txt
KelsoPaper.html XHTML file generated from KelsoPaper.txt using the command a2x -v -f xhtml KelsoPaper.txt
docbook-xsl.css CSS file used to style KelsoPaper.html
KelsoPaper.epub EPUB file generated from KelsoPaper.txt using the command a2x -v -f epub KelsoPaper.txt

4. Conclusion

I am happy with the results of this experiment and hope to be able to further explore the use of Etherpad Lite and AsciiDoc as a tool set for creating free and open documents.

A Book Is A Book And Other Thoughts On Our Webby Future

In February I wrote that every book is a website and we need to embrace the webiness of books. This led to some good discussion about the nature of books generally and casebooks in particular and about the nature of websites. The discussion helped clarify a couple of things in my mind.

First, though every book is a website not every website is a book. As I mentioned in the previous article, once a book is in an electronic form such as EPUB the process to make the book into a website is straight forward. That is not to say that it is easy, but that the path from EPUB to website is clearly marked. The reverse is not true. Moving a website to a book format such as EPUB is not straight forward and may even be impossible.

A website is a often a complex and carefully organized store of information. It may be fairly static, with a single information store arranged and hyperlinked for readers to discover. It may be interactive, drawing the the reader/visitor deeper into the site primarily through the use of hyperlinks to reveal or explain things. It may not even contain any text at all. The design of a website holds clues as to whether or not it can survive the transformation into a book.

Simple static websites are the best candidates for books. Information, often mostly text, is arranged in some sort of linear fashion. Links to outside sites are minimal. A single author or a small group of collaborators gives the site a particular voice. A blog is a good example of the sort of site that lends itself to being bookified.

Contrast this to a more complex and interactive site where the community contributes to the site or games are played or movies are watched. Information is arranged in a non-linear fashion. Links, both internal and external, abound. A multitude of authors, editors, and contributors all bring their voices to the site. Wrangling this into a book could not be done without destroying the value of the site.

This brings me to my second point, a book is a book. It does not matter if the medium is paper or bits, the form and structure of a book is still the same. Books have covers, title pages, tables of contents, chapters, notes (foot or end). The structure of a book is a known thing and the structure carries through all mediums. This is something that makes books unique. A book is a book in hard cover, paperback, on the Kindle, Nook and iPad, in the PDF file on your PC, and ultimately on the web.

Moving a book from print to electronic is not magic and it does not make the book better. The change in format just changes how readers access the book. If you want to make a “better” book, then build a website. Adding interaction and multimedia to a book are often valuable ways to enhance the information that is provided, but adding these enhancements are better done as website than a book.

Transforming a book into a website is the way to make a better book and destroy the book at the same time. Rather than spending time trying to shoehorn elements of a website into a book, we should let go of the book and embrace the web as the book of the future.

New Version of Sigil EPUB Editor To Have WYSIWYG Editor

The forthcoming 0.6.0 version of Sigil, my favorite desktop EPUB editor, is going to have a WYSIWYG HTML editor in the BookView. This is a much needed addition to a great tool that will allow for greater control over the editing and creation of EPUBs.

From Making epub happen:

The next release of Sigil is shaping up nicely. There is so much going into it that the next release will be 0.6.0. Unfortunately, EPUB 3 will not be one of the features making it into 0.6.0. One major change coming will be a new BookView (BV) editor. Here is an unfished preview of what it might look like.

This is only a concept preview of the new editor. One issue that needs to be resolved is the double tool bar. I haven’t decided yet if I’m going to use the one in the BV pane or the global one in the window itself

 

iBooks Author Gives You the Power to Design Your Own Book, Here

This article provides a balanced review of iBooks Author.

Feds Launch Learning Registry To Improve Discoverability of OER

The Learning Registry addresses the problem of discoverability of education resources. There are countless repositories of fantastic educational content, from user-generated and curated sites to Open Education Resources to private sector publisher sites. Yet, with all this high-quality content available to teachers, it is still nearly impossible to find content to use with a particular lesson plan for a particular grade aligned to particular standards. Regrettably, it is often easier for a teacher to develop his own content than to find just the right thing on the Internet.

The Learning Registry is a joint Department of Education + Department of Defense project to provide a common infrastructure for providing discoverable metadata for OER. The goal is to help the teacher locate the “just right” education content that is freely available on the web. Rather than just being yet another portal the Learning Registry is designed as infrastructure with community members running registry nodes that feed metadata and paradata back to other nodes all via a set of open APIs.

This seems like an excellent step toward solving the discovery problem that seems to plague OER.  It also presents a opportunity for folks creating OER in the law school community to create a Learning Registry node for law school OER.

 

Go To Hellman: Creative Commons Media Neutrality and eBook Rights after Rosetta v. Random

Free Law Reporter – My Roadmap

Now that the Free Law Reporter (FLR) has had a few weeks to settle processing Public.Resource.Org’s Report of Current Opinions (RECOP) XML feeds into valid HTML and making sure .epub ebooks is working (I even released the code on github), I thought it might be time to lay out where I see FLR heading in the coming months. Right now (5/17/11) you can visit FLR, search through slip opinions from over 60 state and federal jurisdictions, view the documents in HTML, download complete FLR volumes by jurisdiction as ebooks, and download all documents in a search result as an ebook.

Using this as a foundation, I will be adding several additional features to FLR over the coming months:

  • Advanced search and analysis features to tap the power of Solr
  • Create a library to provide a single point of access to the FLR volumes
  • Select specific documents from search results or browsing to add to custom ebooks
  • Increase the size of the corpus that makes up the Free Law Reporter
  • Edit selected documents from search results or browsing to create truly custom ebooks
  • Provide tools for the community to add value to the Free Law Reporter
  • Citations

The milestones that follow will occur in roughly the order they are listed, but there is no set time table for implementing these features. This is due primarily to the other development projects I have on my plate including the main CALI website, Classcaster, eLangdell, Legal Education Commons, and the CALIcon website.

As you will note a number of the features I have in mind for the FLR will require significant community involvement to really materialize. In this context, I see the community as law librarians, law faculty, and law school technologists with an interest in seeing open and unencumbered access to legal resources for everyone.

 

Advanced search and analysis features to tap the power of Solr – The index, analysis, and search features of the FLR are powered by Apache Solr. Right now I am exposing only a minimum of its potential to provide very basic searching of FLR documents. An advanced search option will provide Boolean operators, phrase and term proximity queries, sub queries, and date range queries. The facet search and “More like this” features of Solr will be exposed to provide drill-down capabilities and access to related documents. All of these will provide a much richer and more robust search environment for locating documents.

All of this development will be done in the open with the hope that the community will get involved in shaping how documents are indexed and analyzed, search is done, and results are displayed. Because Solr is an open source project, we have access to the complete inner workings of the engine. Imagine being able to tune Google specifically for legal resources or adjust WestLawNext to work better for law students and faculty. Those are the sorts of things that can be done with FLR because we have control over the system.

Create a library to provide a single point of access to the FLR volumes – With dozens of volumes being added to the FLR every week finding things becomes an issue. Right now search of the corpus returns links to volumes of the FLR that contain specific opinions allowing for the download of ebooks. That isn’t really helpful if all you want is to download the latest Iowa volume to your ereader. I will add a central library mechanism to track all of those volumes as they are created weekly. This work will be done using the Open Publication Distribution System (OPDS) Catalog specification which will generate feeds that can be consumed by various ereaders and will help locate and track FLR volumes. This OPDS feed will act as an interface that will allow community access to the FLR library. Using the OPDS feed, law libraries could add the Free Law Reporter to their local collections.

Select specific documents from search results or browsing to add to custom ebooks – Right now you can save the complete results from an FLR search as an ebook. While useful, this approach has drawbacks including the fact that all of the documents returned by your search may not ultimately be relevant to your search. I will add the ability to review documents returned in a search and select which documents get included in the ebook. That means that the custom FLR volumes you create will be more relevant to your needs. These custom volumes will be assigned a URL and saved in the FLR library so that they can be shared and downloaded again in the future.

This custom ebook feature will provide a way for  faculty and law librarians to assemble custom volumes of FLR documents that can be shared with students or added to a law library’s local collection. With a some work the community can create custom law reporters that are focused on a single topic.

Increase the size of the corpus that makes up the Free Law Reporter – Right now the FLR contains just the slip opinions from Carl Malamud’s RECOP feeds. That means it covers documents issued by over 60 state and federal jurisdictions since about January 1, 2011. This is a very limited scope for a project with this much potential. To expand the scope of the corpus, I plan on adding the approximately 1,000,000 other federal court opinions available on the Public.Resource.Org website. This will push the depth of the FLR collection to include many of the opinions in the Federal Reporter series. I will also add various other sets of documents that are available as HTML (or in XML that can be transformed) such as the U.S. Code to the collection. The addition of these documents will provide greater context for results found through the FLR search interface and more material that can be used to create custom ebooks.

While U.S. Federal material is relatively easy to obtain and incorporate into the FLR, state level material is more difficult to locate and add to the FLR. Certainly the RECOP feeds provide good access to state appellate court material from January 1, 2011 forward, but the backfile of state court opinions is harder to come by. Likewise state codes and statutes are often difficult to locate and are usually not available in a downloadable format. Community involvement will be the key to building out the state collections in the FLR. Law librarians are an excellent resource for locating state legal materials and I would encourage them to work with state courts and governments to  obtain access to downloadable opinions and codes that can be incorporated into the FLR.

Edit selected documents from search results or browsing to create truly custom ebooks – It follows that once you can select specific documents for inclusion in custom FLR volumes, you will want to be able edit those documents to highlight specific points and/or add commentary. Because the source documents for FLR volumes are HTML, I will be able to provide this feature as part of a process that will allow you to search or browse for documents, select those documents for inclusion in a custom ebook, edit those documents as you see fit, and add your own chapters to the ebook. Once the selection and editing is complete you will be able to save the volume and you will be provided with URL for the volume that you can share or use to download the ebook.

As with the simple selection and publishing features, this feature will provide a way for faculty and law librarians to assemble custom volumes of FLR documents that can be shared with students or added to a law library’s local collection. With a some work, the community can create things like annotated law reporters and statute books. Law faculty can create customized course materials for their students.

Provide tools for the community to add value to the Free Law Reporter – One of the major feature sets I plan to add to the FLR are tools that will allow the community to add value to the collections. For example, tools for adding head notes to a document, tagging a document, and adding commentary to a document. These will provide the community with the capability to enhance and extend the value of the FLR. We all need to get involved in making the Free Law Reporter into a resource that is of great value to students, researchers, and the public, a resource that provides free and unencumbered access to legal materials to those who need to learn about the law.

Citations – I have already been asked several times about how one would cite to the Free Law Reporter. My answer has been that right now I would not cite to the Free Law Reporter. The FLR currently only contains slip opinions that are available more easily elsewhere and any citation should be to the more easily available and recognizable source. I do realize that this is not a satisfactory answer. As the FLR grows it will need to be citable and that is very complicated problem. I have included unique identifiers and lots of metadata in the documents added to the FLR so far. What I would like to see happen is that we talk about this and take the opportunity presented by a new law reporter published in a new medium to figure out the best way to create citations for the FLR. I would suggest using the FLR discussion forum for this.

 

This is where I see the Free Law Reporter headed over the coming months. The FLR project is important because it is intended to create  a resource that provides free and unencumbered access to legal materials to those who need to learn about the law. It is important because it will provide a way for a community of law librarians and faculty to come together to create this valuable resource.

Disclaimer – The Free Law Reporter is a CALI project. This roadmap is where I would like to see the  FLR go and it is not intended to commit CALI to any particular direction on the project.

 

The Report of Current Opinions: Santa Comes Early to the Open Law Movement

Public.Resource.Org will begin providing in 2011 a weekly release of the Report of Current Opinions (RECOP). The Report will initially consist of HTML of all slip and final opinions of the appellate and supreme courts of the 50 states and the federal government. The feed will be available for reuse without restriction under the Creative Commons CC-Zero License and will include full star pagination.This data is being obtained through an agreement with Fastcase, one of the leading legal information publishers. Fastcase will be providing us all opinions in a given week by the end of the following week. We will work with our partners in Law.Gov to perform initial post-processing of the raw HTML data, including such tasks as privacy audits, conversion to XHTML, and tagging for style, content, and metadata.

via The Report of Current Opinions – O\’Reilly Radar.

On Sunday Dec. 19 Carl Malamud made the startling announcement quoted above. And you did read it correctly: “The Report will initially consist of HTML of all slip and final opinions of the appellate and supreme courts of the 50 states and the federal government. ” To say that this is huge would be the understatement of the year.

From personal experience I can tell you that the “slip and final opinions of the appellate and supreme courts of the 50 states and the federal government” have never all been freely available in HTML before. Not even close. At best you could probably wrangle 75% of these opinions in PDF using a mountain of code to scrape sites and parse feeds. To have all this available as a single feed is a game changer.

As a researcher and builder of tools for legal research and education, having access to a single feed that contains all of this data is just the thing I’ve been looking for (and occasionally trying to build) for the past 15 or so years. I have no doubt that the availability of this feed will spark a flurry of development to use the data in new and interesting ways. I will certainly be incorporating it in the CALI tools I’m currently working on.

Of course there are a couple of caveats here. First, we haven’t seen the feed yet. It won’t be available for a few weeks, so right now I’m still just waiting to see what it will look like. Second, there are 2 “timeouts” built into this service, direct government involvement by July 1, 2011 and a general sunset of private sector activity in creating the feed at the end of 2012. The timeouts underscore the belief that providing free and open access to primary legal materials is a duty of the government, plain and simple. As citizens we are bound to follow the law and our government should be obligated to provide us with free and open access to that law.

I know I’m certainly looking forward to a new year that brings greater free and open access to the law. Thanks, Carl.

Princeton Kindle Trial Gets Mixed Reviews

“Much of my learning comes from a physical interaction with the text: bookmarks, highlights, page-tearing, sticky notes and other marks representing the importance of certain passages — not to mention margin notes, where most of my paper ideas come from and interaction with the material occurs,” he explained. “All these things have been lost, and if not lost they’re too slow to keep up with my thinking, and the ‘features’ have been rendered useless.” …

Katz also added that the absence of page numbers in the Kindle makes it more difficult for students to cite sources consistently.

“The Kindle doesn’t give you page numbers; it gives you location numbers. They have to do that because the material is reformatted,” Katz said. He noted that while the location numbers are “convenient for reading,” they are “meaningless for anyone working from analog books.”

via Kindles yet to woo University users – The Daily Princetonian.

Not much of a surprise here. I find the Kindle great for leisure reading, but not so useful for work related stuff. I’ve read a number of novels and even law review articles and court opinions on it and it is fine for straight forward reading (probably why they call them e-READERS). It is not very useful for my tech oriented work material though because I either need to flip around a lot, easy in the physical artifact or copy and paste, straight forward in anything on my desktop.

As mentioned in the quotes above, the lack of annotation and highlighting features (yes it does have basic annotation features, but they are not all that great.) coupled with the inability to make meaningful citations are real show stoppers in legal academia. And, FWIW, it isn’t just the Kindle that is a problem here, it is the entire class of dedicated e-readers. As a class they are designed for reading, books mostly, but increasingly news sources as well.

Learning is a different activity than reading, though reading can be a component of learning. In a learning environment, reading becomes more than the passive recognition of words. It should be an active experience that makes use of the reading material as a raw material that is transformed into knowledge. What we need is a class of device that doesn’t really exist yet, the e-learner.

The e-Learner device would be an excellent e-reader with full unfettered net access that provides the student with the ability to use the reading material to build knowledge. It would have annotation, highlighting, and note taking capabilities, and provide some sort of universal citation format for citing the materials. A touch interface would allow for flipping through materials and real, deep hyperlinking would allow for meaningful search and discovery. Net access would provide a gateway to wider bodies of information and knowledge.

Of course no such device really exists today. You could sort of fake it with a laptop or netbook, but the underlying tools and resources are not really there. Merely converting a casebook or course materials into PDF or HTML only starts the process, but it is a start. Once materials are in an electronic format then the other tools will come along. And perhaps one day not long from now we will see the e-learner.

FastPencil Brings Self-publishing To The Web

FastPencil is self-publishing with a twist. The traditional publishing process is a daunting one that can take many months of effort and more money than most writers anticipate. It’s no wonder authors get discouraged.

You shouldn’t have to ask anyone’s permission to write and publish your own book! We have removed the hurdles inherent in traditional publishing by combining amazing advances in print on demand technology with a sophisticated online workflow system.

FastPencil

This site provides a one stop resource for writing and publishing a ‘book’. On the authoring side it provides a light weight web interface for creating an outline and entering text so you can write in the browser. It provides tools for importing blog posts, adding collaborators, assigning editors and more.

Once you’ve committed your work to ‘paper’ the publishing features allow you to categorize the work, set copyright, including Creative Commons licensing, add cover art, and select formats. You may publish your work as an EPUB e-book or as a printed book. If you opt for a physical artifact, you can select typeface, paper size, set a price, and more.

FastPencil is free to use up to the point where you want to order physical books. Packages are available that provide services such as professional review and editing, cover art design, retail distribution and more.

So, it looks like FastPencil may be a good choice for authors interested in retaining complete control over their work or those interested in publishing their work but not inclined to clear all of the hurdles imposed by traditional publishers.