Why a post on Incremental Reading PDF’s is taking so long?

March 22, 2010

At first it sounded like an easy idea, you just plod around some strategy, write down the steps involved and try it up. I’ve done that, but the problem of dealing with PDF’s was, as I was about to find out, of a much higher order of magnitude.

Randy Pausch clearly stated “brick walls are there, not to stop you, are there to stop the others”, if you haven’t had the opportunity to see what over 11 million visitors already have, I recommend you take some time and watch  he’s last lecture.

brick wall

Incremental reading PDF”s  is clearly not a childhood dreams is just some small project, but unless some one comes out with some brilliant idea I feel like banging my self to this wall, until someone helps me go over it instead of being stop by it.

I’d like to have ideas on how you’re dealing with the same issue of incrementally reading PDFs? email me or just post a comment.

What I have already tried

Read and Copy.

Perhaps the easiest approach is to just read the text, and as soon as you find something important, you select the text and copy it to the clipboard and back into SuperMemo. This leads to a lot bad formatting spaces, as PDF’s don’t have a single clue about text structure. You might of course paste this extract to some txt editor and delete (trim) all white spaces, and then imported down to supermemo.

But, by the time you’re making the first extract you’ve already used a lot more cognitive resources that you’re suppose to, in order to understand what you were reading in the first place, and you don’t even have the source on your extract, or any highlight in the PDF to acknowledge what you’ve extracted.


I’ve downloaded every single trial OCR engine out there — I personally own Abby Fine reader but was looking for something to solve the SuperMemo issue. After loading this programs with PDF articles, OCR them, and saving them as TXT, RTF, DOC, DOCX, HTML, XHTML, etc. (as exact copy), my conclusion is  that it simple takes to much time to select those picture and text that, although are so obvious to the eyes, somehow, until know, no software is able recognize with an accuracy over 70%. I’m not talking about OCR per se, text recognition accuracy is over 95% in almost any good OCR software. But the text structure it self is really badly recognized. To make things short, OCR doesn’t make a SuperMemo importable file.

Best you can hope for is that the PDF article has a simple structure and is short in magnitude, then you can load it to SupeMemo, but this is definitively not a complete solution.

Read and Copy via Autohotkey.

This is the closest I have come to process text from PDF articles, but this is not a simple AHK scripting solution. It will not work everywhere on any PDF reader or any platform, and you can’t or otherwise you won’t have a complete orderly processing of any article so that you know what you’ve extracted, from what source, and have unique PDF-ID to be able to find it fast enough in case need it a couple of months (or years) after making the extract; even less to to know if some particular piece of text has already been introduced into SuperMemo. You need all these to able to read incrementally PDF with out importing the complete article or book into SuperMmoe, at the same time you want to make it fast, and almost automatic, and of course you’ll want some method not so unorthodox that  the time to come some SuperMemo or you PDF reader version comes out, you won’t need to change you method again and again.

Another situation that makes this process difficult to sort out is that, more and more, I’m relying on a simple file system information research database, to liberate my self from proprietary software formats, only txt and jpg are now my preferred way to save files. I would probably go as far as saving everything in XHTML, but I’d like to have access to my database at any time with out having to worry that some particular application random-ware has become forgotten-ware, and course the XHTML is non XZalphaHTML or something like that making my old files unreadable with out hour of converting them.

Life long learning demands standardized tools, and that is something I was not aware until my collection went over 25,000 q&a and more then 6’000 full articles, then I understood PDF can’t also go as full books into my SuperMemo, because it makes no sense to include (currently) any information, if I’m not certain it will be read (must read lectures), or has already been read (extracts from books, now on future PDF extracs). It took me this number of importing to acknowledge that simple is better, one of PW rules.

Being said that, don’t think we need to restrain from including any article we’d like to read on the future on some systematic way that lets us, if time is available, import it or read it and import its extracts later on to SuperMemo, but we must deal it outside SuperMemo to keep it from getting in to a knowledge overweight issue, so difficult to fix with out a lot of wasted time  pruning article no longer needed or indefinitely postponing low priority articles while doing our repetitions and taking time out of repetitions times (aka learning and reviewing).

Scrapbook, a Fire Fox add-on, is great for reading outside supermemo and later importing into any SRS (my prefference of course still SuperMemo, still). I’m using a simple IFFRS (Incremental Folder Filtering Repetitions System) on Scrapbook for all my would like to read articles. I had previously overlook this add-on but current version is much better, thanks Marcin Rybacki for remembering me about this great add-on.

I’m not far from achieving a systematic way of reading PDF’s into SuperMemo but I’m also not enough close to a simple nice solution that would deserve posting it fully

On the last note, some visitors have being searching for the cognywiki on the blog, certainly this must a result of mentioning it before, well a couple of days ago I got in touch with Oliver Geordon, cognywiki’s creator and he was kind enough to let me post his method here, so that should make a new post later on. I think he has a lot of nice principles on how to deal with notetaking at least worthwhile the time to try it out, specially if your objective is not only life long learning but also life long note-taking ala Thomas Edison.

Hope you guys doing great, sorry for such a wordy post, sometimes I feel like I’m on adderall, my brain keeps going and going, if only all this thinking was only full on great ideas. Until next time.

9 Comments leave one →
  1. March 22, 2010 16:11

    I also had issuses with reading PDFs. No solution so far – I just import text where it is enough…

    I recently bought some ebooks which are DRM-encrypted PDFs – I couldn’t copy, so I had to import paragraphts through Abbyy Screenshot Reader. It seems that the reading has to be done in Adobe Reader and extracting paragraphs is where can Supermemo be used…

    • gersapa permalink*
      March 25, 2010 09:37

      Thanks Robert for the comment, I forgot to mention the DRM issues, to which I have no solution. Using Abby Screenshot Reader is in fact my only way to import DRM pdf’s into SuperMemo, I also extract the window title and the page number via AHK in order to have a source of the extract on SuperMemo. I keep thinking the must be a way…

  2. Alex permalink
    March 22, 2010 20:03

    I had forgotten about cognywiki until I read the keyword on this blogpost. Great. Unfortunately, I could not refresh my memory on it because it’s no longer online. I found it on the wayback machine, though:

    Looking forward to further discussion on note-taking.

    • gersapa permalink*
      March 25, 2010 09:44

      I got a copy of cognywiki trough Wayback Machine to.

      Oliver told me they put his site down as he is no longer affiliated to the university hosting his cogniwiki, he sent me another copy of cognywiki that is the same as the last version available trough Wayback Machine, I ask him this mainly because I got an error on a macro with FF3.6.2 but if used on FF3.0 it works fine.

      Regarding this note taking principles, I’m mostly interested in the fact that most notes taken need not to be reviewed ever after you take them down, and while doing a review you’ll find you didn’t need them in the first place, hence you shouldn’t waste time organizing them, while if you don’t take down any notes you think of, you’ll probably end up losing great ideas, or important appointments. I’m currently drafting a post on the issue.

  3. mndfll permalink
    April 4, 2010 11:17

    I have struggled with pdfs, usually either extracting a small amount of text at a time or taking screenshots of small sections with Foxit Reader.

    I liked this review of 10 pdf conversion programs/sites:

    but when I used several of the top rated programs (PDFonline, Nuance, and Nitro), they failed me on a number theory text with poor conversion of fractions and some special symbols. Is there a decent conversion program for math texts? It doesn’t need to handle pictures well, just math text.

  4. Pradery permalink
    April 14, 2010 01:18

    This is my approach:

    I have abandoned Supermemo. At least, for some time….

    I need to read a lot, I can’t spend my time converting docs and managing daily supermemo reps.

    Now I’m dividing each book I read in blocks, typically chapters o subchapters, something usually between 5 or 20 pages. Al my reading tasks are included as part of my to-do system in “My Life Organized” (with its excelent flexibility to deal with recurring tasks). Every time I finish reading a chapter or block I decide when it will appear again in my to-do list. Usually, one or more weeks, depending on the importance and my degree of knowledge.

    No more recalls. Just reading and reading with adaptative recurrence.

    • gersapa permalink*
      April 14, 2010 07:37

      This is in deed a great approach, thanks for sharing. I made a folder structure on Scrapbook a couple of month ago for reading only, it includes a similar structure to a tickler folder, how ever relying only on MLO, which I also use, a nice way to deal not only with PDFs but any type of reading (even paper).

      For me the difficult part would be deciding when it will appear again. Do you have some predetermined times for each reading? (say 1st pass after a week, 2pass after a month, or do you just go by your intuition).

      Several studies say we are not capable of determining how much you can remember after reading or studying something, I would like to know how effective do you find your method is? so, please, update your follow up on this method.

      • Pradery permalink
        April 15, 2010 02:17

        Yes, that’s what I do with paper too, not only ebooks.

        The key is to read everything carefully, enjoying the activity. At this point I don’t care of anything but living the present moment. Just intrinsic motivations take place in this phase.

        When I finish my reading, the thing is different and now my motivations are extrinsic, conecting to goals. I take my time to reflect upon the materials I’ve read and how they relate to my objectives. In this process I decide when they will be read again.

        The magic of it this is that, when you are not concerned about memorizing at all and just enjoying the pleasure of reading, you find yourself memorizing much more than you may think.

        So relax, enjoy and don’ care about memorizing, everything will be re-read again. When? Let your true and honest intuition, not a program, answer that question.

        • gersapa permalink*
          June 14, 2010 15:22

          Pradery: We are all different, regards almost any imaginable trait, so I’m thinking that the statement:

          “The magic of it this is that, when you are not concerned about memorizing at all and just enjoying the pleasure of reading, you find yourself memorizing much more than you may think.”

          Indeed works great with you, to me its not that same way, I forget, and forget a big portion of what I read if I don’t process it, or at least take some notes. Some times I recall about have read some book before, some other times, in cases where I have tracked previously read articles, I can’t even remember having read them before, perhaps this is just interference, I’m an avid readers, some say voracious, my intention here is not to brag about my reading likes, but in my case I just forget, I need structure and tools like this help me get it.

          Mind mapping of course helps out a lot, and where information is complex I use them to understand and relate concepts.

          I hope you keep sharing with us you reading and study methods.

