Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

"Gutenberg Shuffle" by Isaac Karth #63

Open
ikarth opened this Issue · 7 comments

2 participants

ikarth Michael Paulukonis
ikarth

I have no idea what the result is going to look like, but I'm in.

So far I've just parsed out all of the non-dialog bits of Pride & Prejudice. It makes for a very short pamphlet.

ikarth

Progress report: After wading through the bog of text parsing, I've emerged on temporarily dry ground, having encountered a couple of artifacts along the way:
Pride and Prejudice: Action Edition (53K words)
Take Jane Austen, leave out all of that talking stuff. Produced as a side effect of something else, though cleaning it up for output took a bit of learning on my part. This is the exact output from the program.

Gutenberg Shuffle (473K words)

Take a collection of texts from Gutenberg, generate a list of all of the names you find. Swap the names around. Shuffle the sentences. Shuffle and redeal for every paragraph.

I'd like to push this further, and look at ways to get more coherence and systems behind the text selection, but I do like the opening. It's also a bit odd about what it considers to be a name; hence there are characters called Journey, London, and Mizzen-mast Hill's adopted daughter, Mizzen-mast Hill.

The program is far from anything I'd consider finished (or good) but it's producing output, which is a nice start.

ikarth

The repository, if you really want to look at the code.
Written in Clojure/Java and using the OpenNLP library.

Michael Paulukonis

@ikarth "cleaning it up for output took a bit of learning on my part." - that's half the reason to be involved in NaNoGenMo!

Neat stuff.

ikarth

I've got another novel in the works, but the generator is still running.

Michael Paulukonis

How long does it take to run?

And... is that really 473 THOUSAND words?

ikarth

Well, I haven't counted them by hand. "Gutenberg Shuffle" came from generating 5000 paragraphs, which turned out to be overkill. I don't think it's structurally interesting over all 5000 (I haven't read all 5000) but there's some interesting imagery in there. Tempted to use it as a source of prompts for creative writing exercises.

Unfortunately, I have no idea how long it will take, because I haven't run the updated generator with this much source text before. It could probably stand a little optimization. And I forgot to add a progress indicator and I don't want to fiddle with the running generator at the moment.

ikarth

Well, it took 8 hours...and crashed with a NullPointer exception. After tracing the crash to a function that was trying to treat a non-string tag as a string, I rewrote the sentence generator to fix the bug. I also added some indications of progress and rearranged the algorithm so it only does the heavy parsing on the sentences that are actually being used. Together with some parallelization, this brings the generation time down to about 15 minutes, and is mostly bottlenecked by the length of the book rather than the length of the source texts.

I therefore present:
POMERANIAN-HAMBLETONIAN-RED-IRISH-COCHIN-CHINA-STOKE-POGIS: A Novel (76K Words)
I'm particularly fond of the last line in the second paragraph.

I have a bunch of things I'd like to improve, particularly a better random sentence selection that sticks similar sentences together. Maybe a biased shuffle, maybe something more elaborate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.