Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Better Python Scraping - Installing lxml and Beautiful Soup (wesleyzhao.com)
36 points by wesleyzhao on June 28, 2011 | hide | past | favorite | 38 comments


Ive scraped dozens of sites beautiful.soup is great, but if you want the job done as quickly and cleanly as possible, PyQuery is thr best.

Its the same as jquery, but in Python.

Working with beatiful soup quickly becomes.long.and messy and tedious

With pyquery, you get what you want with just a couple of CSS 3 selectors simplw and nice

wow android 2.2 is terrible for inputting text


Offtopic: Inputting text with Swype on Android is a breeze once you get used to it.


Hm interesting thought. Any good tutorials/sites I can use to get going on this? Or is it so simple I won't even need that. I find myself scraping a lot, so finding the best lib for that is a priority.


Just tried this out last week. The docs are pretty sparse, but it seems to mirror the jQuery interface fairly closely so if you're familiar with one, you'll have a fair idea of the other: http://packages.python.org/pyquery/api.html

That's the key advantage as I see it. If I'm scraping something it's often in a hurry and I just want it done. Not having to internalise a new API is a significant win in that respect.


I second that, PyQuery is excellent. I've even used it for manipulating an RSS feed: http://pastebin.com/QnF0Li3m


I'm learning python on the fly, but I tend to ask a lot of question on freenode's #python. Installing lxml wasn't so bad. I just did "pip install lxml" (easy_install lxml should work too) on my Debian VPS and home server. Seemed to work for me.

I am sticking with lxml only for my scraping and html5lib to do my richtext parsing.


Glad to hear it was so easy for you! For some reason, before I found a great tutorial, I kept running into errors on my installations. First it was because I didn't even have setuptools installed, then I didn't have some other dev dependencies, then I just got plain stuck. But after I got it up and running it was smooth sailing RE scraping!


Installing development headers is an essential sysadmin skill. Any failed python extension install should make it clear which headers are missing, and you will want to search for the matching package, usually suffixed with "-devel" (at least on Debian and RedHat systems). So if you see an issue with libjpeg, you will look for libjpeg-devel, zlib -> zlib-devel, etc.


I much prefer Scrapy (http://scrapy.org/). BeautifulSoup is pretty outdated.


I concur, but technically Scrapy is an entire web scraping/crawling framework for writing crawlers, not just XML/HTML parsing like BeautifulSoup or lxml. You don't even have to use Scrapy's built in processor, you can use BeautifulSoup (or whatever else) if you want. What Scrapy gets you is all the logic for crawling of the web pages (requesting pages, reacting to html errors, etc.). You basically just tell it what urls to parse, what to parse from the pages, and what to do with the parsed data. It handles all the rest. I used it Scrapy just recently on an online movie site (shameless plug: www.qwink.com).


XPath/CSS path selectors for scraping are definitely the "now" way to do it. I recommend Nokogiri (http://nokogiri.org/) for the Ruby users, it does pretty much the same thing in a nutshell.

What it really comes down it is 2x-3x smaller code, plus much faster to write it since you can just test your CSS selector in your web browser such as with FireBug before sticking it into your code.


If I'm not mistaken LXML uses XPath and CSS path selectors right?


Interesting - will have to play with it when I wake up. Any reason in particular you think it is better than lxml? Beautiful Soup is only used with lxml to parse broken HTML into a DOM. Lxml does most of the heavy lfiting


Scrapy is a django-like framework. It has more structure and a lot of built-in features compared to using a library like lxml.

lxml and Scrapy use the same library on the backend, libxml2.

http://doc.scrapy.org/faq.html#how-does-scrapy-compare-to-be... http://doc.scrapy.org/topics/selectors.html


If its a framework does it take a little longer to setup/get used to? If so is the learning curve worth it?


I can't make a fair comparison as far as learning curve, but if you're new to scraping Scrapy might in theory be easier. You get a bunch of tools built-in that you don't have to come up with yourself, and you get a pre-defined way of doing things. However, last time I checked the example project in the docs was a little lacking and there's not many good external examples either.

I personally really like having the structure of a framework. It lets me churn out simple project from boilerplate very quickly, and it helps keep larger projects organized.


How is BeautifulSoup "outdated"? It's meant for making it easy to scrape non-properly-formatted sites, and works very well at that, as I've found out in practice.

What's the advantage of Scrapy?


http://www.crummy.com/software/BeautifulSoup/3.1-problems.ht...

I guess it looks like BeautifulSoup finally got a 4.0 alpha release out which supposedly works, but that took several years. The codebase is aged and releases are incredibly slow.

"I no longer enjoy working on Beautiful Soup..."

"Parsing is longer a competitive advantage for Beautiful Soup, and I'd be happier if I could get out of the parser business altogether."


Very interesting... I typically would think you should not use a product that a founder/developer does not believe in


Why not "sudo apt-get install python-lxml python-beautifulsoup"? Difficult to make the "olde" argument when you're installing dependencies from apt.


That doesn't work with virtualenv, does it? The advantage of pip and easy_install is that you can install it per project without fucking up global dependencies, etc.


It should be fine as long as you don't use --no-site-packages


Does that work? If I remember correctly I may have tried that and it was one of the things that failed. Something would not compile I believe. I am still a noob so if that works and I just messed it up I wouldn't be surprised. Do you know by experience that is all you need?


... of course it will work. And there's no compilation step involved, either, as I've never encountered a Debian package that was not a binary distribution (but maybe there are some, requiring build-essential? This package surely does not depend on build-essential).


I'll run this on a clean Ubuntu instance and if this works I'll update my post! Thanks!!


yes, that's really all you need. I've just tried on a couple of different versions of ubuntu, and it worked on them all.

If you find a version of ubuntu or debian where that doesn't work, file a bug!


Worked for me when I did it on Debain Squeeze. I later ended up uninstalling the apt-get version and install the newer version from pipy via pip.


Interesting - so would this not have worked for my Ubuntu ami instance or my Ubuntu home machine?


I can't speak for Ubuntu, but it's supposed to be reasonably stable and there is absolutely no reason to believe that package installation for this sort of software would differ between machines.


Works just fine on Ubuntu.


Just tried that on a new Ubuntu instance...it worked!!! Wow...


Is mechanize (http://wwwsearch.sourceforge.net/mechanize/) considered outdated or convoluted? It's what I've used for my scrapings.

Also, how well do these other scrapers handle Javascript? I've had to abandon some scrapes from ASP pages because they wouldn't properly handle it.


it's a different tool. Mechanize concentrates on navigating pages and downloading pages, following links, handling cookies etc. BeautifulSoup and lxml parse information out of the html.

There's some overlap, but not much. I have tended to use BeautifulSoup and mechanize together. As mentioned above, BeautifulSoup is no longer being actively maintained, and I'd recommend starting with lxml in most cases. I'm still using BeautifulSoup mainly because I have most of the package memorized.


Thanks, I use the same combination, but haven't needed to make any new parses in a while. I'll read up on lxml for the next time.


Is there any other tool that easily help with JS parsing?? This would actually be extremely useful since there are so many JavaScript generated pages.


Maybe I missed it: why aren't you using pip? As I recall, the set up is as simple as: sudo pip install lxml or sudo pip install BeautifulSoup. If you're learning Python, definitely learn pip. Pip will make your life easier! :)


Just to throw in my own "I like X" better. I recently had some pages that Beautiful Soup just choked on and couldn't parse.

I like html5lib, which will even spit out a Beautiful Soup parse tree if that's your thing.


Does this page kill the chrome tab for anyone else?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: