Pull content from an "old school" tabular design webpage

dnguyen411 · September 2007

I've been searching on the web for the best way to do this but I haven't found anything. Basically, I'm redesigning an internal web site for my company to comply with html 4.01 standards and css. I need to grab the content and links from these old pages but the guy who designed the page used tables for design. Some of the pages are long and I don't have time to retype and re-link all the content.

Is there a way to just grab the content and links without all the stupid table layout crap? I would prefer a script or a program to do this.

Thanks,

Apreche · September 2007

Sounds like a job for regular expressions.

dnguyen411 · September 2007

I looked at the Wikipedia article for Regular Expression. It sounds like it will do the job but I have very little (i.e. none) experience in writing regular expressions. Any hints on how to get started?

Apreche · September 2007

I looked at the Wikipedia article for Regular Expression. It sounds like it will do the job but I have very little (i.e. none) experience in writing regular expressions. Any hints on how to get started?

It's hard. Start doing research on "HTML scraping". You are going to need to write programs.

dnguyen411 · September 2007

Ahhh.... I hate programming.

I found a website with your suggestion (http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html)

I guess once I get this program down, the actual extraction will be much easer.

Thanks.

Howdy, Stranger!

Categories

Pull content from an "old school" tabular design webpage

Comments