Better Python Scraping - Installing lxml and Beautiful Soup

ljlolel · on June 28, 2011

Ive scraped dozens of sites beautiful.soup is great, but if you want the job done as quickly and cleanly as possible, PyQuery is thr best.

Its the same as jquery, but in Python.

Working with beatiful soup quickly becomes.long.and messy and tedious

With pyquery, you get what you want with just a couple of CSS 3 selectors simplw and nice

wow android 2.2 is terrible for inputting text

cdr · on June 28, 2011

Offtopic: Inputting text with Swype on Android is a breeze once you get used to it.

wesleyzhao · on June 28, 2011

Hm interesting thought. Any good tutorials/sites I can use to get going on this? Or is it so simple I won't even need that. I find myself scraping a lot, so finding the best lib for that is a priority.

pxm · on June 28, 2011

Just tried this out last week. The docs are pretty sparse, but it seems to mirror the jQuery interface fairly closely so if you're familiar with one, you'll have a fair idea of the other: http://packages.python.org/pyquery/api.html

That's the key advantage as I see it. If I'm scraping something it's often in a hurry and I just want it done. Not having to internalise a new API is a significant win in that respect.

ericmoritz · on June 28, 2011

I second that, PyQuery is excellent. I've even used it for manipulating an RSS feed: http://pastebin.com/QnF0Li3m

VuongN · on June 28, 2011

I'm learning python on the fly, but I tend to ask a lot of question on freenode's #python. Installing lxml wasn't so bad. I just did "pip install lxml" (easy_install lxml should work too) on my Debian VPS and home server. Seemed to work for me.

I am sticking with lxml only for my scraping and html5lib to do my richtext parsing.

wesleyzhao · on June 28, 2011

Glad to hear it was so easy for you! For some reason, before I found a great tutorial, I kept running into errors on my installations. First it was because I didn't even have setuptools installed, then I didn't have some other dev dependencies, then I just got plain stuck. But after I got it up and running it was smooth sailing RE scraping!

oinksoft · on June 28, 2011

Installing development headers is an essential sysadmin skill. Any failed python extension install should make it clear which headers are missing, and you will want to search for the matching package, usually suffixed with "-devel" (at least on Debian and RedHat systems). So if you see an issue with libjpeg, you will look for libjpeg-devel, zlib -> zlib-devel, etc.

cdr · on June 28, 2011

I much prefer Scrapy (http://scrapy.org/). BeautifulSoup is pretty outdated.

jdq · on June 28, 2011

I concur, but technically Scrapy is an entire web scraping/crawling framework for writing crawlers, not just XML/HTML parsing like BeautifulSoup or lxml. You don't even have to use Scrapy's built in processor, you can use BeautifulSoup (or whatever else) if you want. What Scrapy gets you is all the logic for crawling of the web pages (requesting pages, reacting to html errors, etc.). You basically just tell it what urls to parse, what to parse from the pages, and what to do with the parsed data. It handles all the rest. I used it Scrapy just recently on an online movie site (shameless plug: www.qwink.com).

iam · on June 28, 2011

XPath/CSS path selectors for scraping are definitely the "now" way to do it. I recommend Nokogiri (http://nokogiri.org/) for the Ruby users, it does pretty much the same thing in a nutshell.

What it really comes down it is 2x-3x smaller code, plus much faster to write it since you can just test your CSS selector in your web browser such as with FireBug before sticking it into your code.

wesleyzhao · on June 28, 2011

If I'm not mistaken LXML uses XPath and CSS path selectors right?

wesleyzhao · on June 28, 2011

Interesting - will have to play with it when I wake up. Any reason in particular you think it is better than lxml? Beautiful Soup is only used with lxml to parse broken HTML into a DOM. Lxml does most of the heavy lfiting

cdr · on June 28, 2011

Scrapy is a django-like framework. It has more structure and a lot of built-in features compared to using a library like lxml.

lxml and Scrapy use the same library on the backend, libxml2.

http://doc.scrapy.org/faq.html#how-does-scrapy-compare-to-be... http://doc.scrapy.org/topics/selectors.html

wesleyzhao · on June 28, 2011

If its a framework does it take a little longer to setup/get used to? If so is the learning curve worth it?

cdr · on June 28, 2011

I can't make a fair comparison as far as learning curve, but if you're new to scraping Scrapy might in theory be easier. You get a bunch of tools built-in that you don't have to come up with yourself, and you get a pre-defined way of doing things. However, last time I checked the example project in the docs was a little lacking and there's not many good external examples either.

I personally really like having the structure of a framework. It lets me churn out simple project from boilerplate very quickly, and it helps keep larger projects organized.

wladimir · on June 28, 2011

How is BeautifulSoup "outdated"? It's meant for making it easy to scrape non-properly-formatted sites, and works very well at that, as I've found out in practice.

What's the advantage of Scrapy?

cdr · on June 28, 2011

http://www.crummy.com/software/BeautifulSoup/3.1-problems.ht...

I guess it looks like BeautifulSoup finally got a 4.0 alpha release out which supposedly works, but that took several years. The codebase is aged and releases are incredibly slow.

"I no longer enjoy working on Beautiful Soup..."

"Parsing is longer a competitive advantage for Beautiful Soup, and I'd be happier if I could get out of the parser business altogether."

wesleyzhao · on June 28, 2011

Very interesting... I typically would think you should not use a product that a founder/developer does not believe in

lamby · on June 28, 2011

Why not "sudo apt-get install python-lxml python-beautifulsoup"? Difficult to make the "olde" argument when you're installing dependencies from apt.

pavel_lishin · on June 28, 2011

That doesn't work with virtualenv, does it? The advantage of pip and easy_install is that you can install it per project without fucking up global dependencies, etc.

ianb · on June 28, 2011

It should be fine as long as you don't use --no-site-packages

wesleyzhao · on June 28, 2011

Does that work? If I remember correctly I may have tried that and it was one of the things that failed. Something would not compile I believe. I am still a noob so if that works and I just messed it up I wouldn't be surprised. Do you know by experience that is all you need?

oinksoft · on June 28, 2011

... of course it will work. And there's no compilation step involved, either, as I've never encountered a Debian package that was not a binary distribution (but maybe there are some, requiring build-essential? This package surely does not depend on build-essential).

wesleyzhao · on June 28, 2011

I'll run this on a clean Ubuntu instance and if this works I'll update my post! Thanks!!

danohuiginn · on June 28, 2011

yes, that's really all you need. I've just tried on a couple of different versions of ubuntu, and it worked on them all.

If you find a version of ubuntu or debian where that doesn't work, file a bug!

VuongN · on June 28, 2011

Worked for me when I did it on Debain Squeeze. I later ended up uninstalling the apt-get version and install the newer version from pipy via pip.

wesleyzhao · on June 28, 2011

Interesting - so would this not have worked for my Ubuntu ami instance or my Ubuntu home machine?

oinksoft · on June 28, 2011

I can't speak for Ubuntu, but it's supposed to be reasonably stable and there is absolutely no reason to believe that package installation for this sort of software would differ between machines.

kevindication · on June 28, 2011

Works just fine on Ubuntu.

wesleyzhao · on June 28, 2011

Just tried that on a new Ubuntu instance...it worked!!! Wow...

tsumnia · on June 28, 2011

Is mechanize (http://wwwsearch.sourceforge.net/mechanize/) considered outdated or convoluted? It's what I've used for my scrapings.

Also, how well do these other scrapers handle Javascript? I've had to abandon some scrapes from ASP pages because they wouldn't properly handle it.

danohuiginn · on June 28, 2011

it's a different tool. Mechanize concentrates on navigating pages and downloading pages, following links, handling cookies etc. BeautifulSoup and lxml parse information out of the html.

There's some overlap, but not much. I have tended to use BeautifulSoup and mechanize together. As mentioned above, BeautifulSoup is no longer being actively maintained, and I'd recommend starting with lxml in most cases. I'm still using BeautifulSoup mainly because I have most of the package memorized.

tsumnia · on June 28, 2011

Thanks, I use the same combination, but haven't needed to make any new parses in a while. I'll read up on lxml for the next time.

wesleyzhao · on June 28, 2011

Is there any other tool that easily help with JS parsing?? This would actually be extremely useful since there are so many JavaScript generated pages.

llambda · on June 28, 2011

Maybe I missed it: why aren't you using pip? As I recall, the set up is as simple as: sudo pip install lxml or sudo pip install BeautifulSoup. If you're learning Python, definitely learn pip. Pip will make your life easier! :)

imgabe · on June 28, 2011

Just to throw in my own "I like X" better. I recently had some pages that Beautiful Soup just choked on and couldn't parse.

I like html5lib, which will even spit out a Beautiful Soup parse tree if that's your thing.

Torn · on June 28, 2011

Does this page kill the chrome tab for anyone else?