Hm interesting thought. Any good tutorials/sites I can use to get going on this? Or is it so simple I won't even need that. I find myself scraping a lot, so finding the best lib for that is a priority.
Just tried this out last week. The docs are pretty sparse, but it seems to mirror the jQuery interface fairly closely so if you're familiar with one, you'll have a fair idea of the other: http://packages.python.org/pyquery/api.html
That's the key advantage as I see it. If I'm scraping something it's often in a hurry and I just want it done. Not having to internalise a new API is a significant win in that respect.
I'm learning python on the fly, but I tend to ask a lot of question on freenode's #python. Installing lxml wasn't so bad. I just did "pip install lxml" (easy_install lxml should work too) on my Debian VPS and home server. Seemed to work for me.
I am sticking with lxml only for my scraping and html5lib to do my richtext parsing.
Glad to hear it was so easy for you! For some reason, before I found a great tutorial, I kept running into errors on my installations. First it was because I didn't even have setuptools installed, then I didn't have some other dev dependencies, then I just got plain stuck. But after I got it up and running it was smooth sailing RE scraping!
Installing development headers is an essential sysadmin skill. Any failed python extension install should make it clear which headers are missing, and you will want to search for the matching package, usually suffixed with "-devel" (at least on Debian and RedHat systems). So if you see an issue with libjpeg, you will look for libjpeg-devel, zlib -> zlib-devel, etc.
I concur, but technically Scrapy is an entire web scraping/crawling framework for writing crawlers, not just XML/HTML parsing like BeautifulSoup or lxml. You don't even have to use Scrapy's built in processor, you can use BeautifulSoup (or whatever else) if you want. What Scrapy gets you is all the logic for crawling of the web pages (requesting pages, reacting to html errors, etc.). You basically just tell it what urls to parse, what to parse from the pages, and what to do with the parsed data. It handles all the rest. I used it Scrapy just recently on an online movie site (shameless plug: www.qwink.com).
XPath/CSS path selectors for scraping are definitely the "now" way to do it. I recommend Nokogiri (http://nokogiri.org/) for the Ruby users, it does pretty much the same thing in a nutshell.
What it really comes down it is 2x-3x smaller code, plus much faster to write it since you can just test your CSS selector in your web browser such as with FireBug before sticking it into your code.
Interesting - will have to play with it when I wake up. Any reason in particular you think it is better than lxml? Beautiful Soup is only used with lxml to parse broken HTML into a DOM. Lxml does most of the heavy lfiting
I can't make a fair comparison as far as learning curve, but if you're new to scraping Scrapy might in theory be easier. You get a bunch of tools built-in that you don't have to come up with yourself, and you get a pre-defined way of doing things. However, last time I checked the example project in the docs was a little lacking and there's not many good external examples either.
I personally really like having the structure of a framework. It lets me churn out simple project from boilerplate very quickly, and it helps keep larger projects organized.
How is BeautifulSoup "outdated"? It's meant for making it easy to scrape non-properly-formatted sites, and works very well at that, as I've found out in practice.
I guess it looks like BeautifulSoup finally got a 4.0 alpha release out which supposedly works, but that took several years. The codebase is aged and releases are incredibly slow.
"I no longer enjoy working on Beautiful Soup..."
"Parsing is longer a competitive advantage for Beautiful Soup, and I'd be happier if I could get out of the parser business altogether."
That doesn't work with virtualenv, does it? The advantage of pip and easy_install is that you can install it per project without fucking up global dependencies, etc.
Does that work? If I remember correctly I may have tried that and it was one of the things that failed. Something would not compile I believe. I am still a noob so if that works and I just messed it up I wouldn't be surprised. Do you know by experience that is all you need?
... of course it will work. And there's no compilation step involved, either, as I've never encountered a Debian package that was not a binary distribution (but maybe there are some, requiring build-essential? This package surely does not depend on build-essential).
I can't speak for Ubuntu, but it's supposed to be reasonably stable and there is absolutely no reason to believe that package installation for this sort of software would differ between machines.
it's a different tool. Mechanize concentrates on navigating pages and downloading pages, following links, handling cookies etc. BeautifulSoup and lxml parse information out of the html.
There's some overlap, but not much. I have tended to use BeautifulSoup and mechanize together. As mentioned above, BeautifulSoup is no longer being actively maintained, and I'd recommend starting with lxml in most cases. I'm still using BeautifulSoup mainly because I have most of the package memorized.
Maybe I missed it: why aren't you using pip? As I recall, the set up is as simple as: sudo pip install lxml or sudo pip install BeautifulSoup. If you're learning Python, definitely learn pip. Pip will make your life easier! :)
Its the same as jquery, but in Python.
Working with beatiful soup quickly becomes.long.and messy and tedious
With pyquery, you get what you want with just a couple of CSS 3 selectors simplw and nice
wow android 2.2 is terrible for inputting text