- Oct. 4th, 2011
- 7 comments
After SMX, everyone is scrambling to support Schema.org's embodiment of Microformats. Our customers are asking us to implement it, but I don't think everyone is thinking this through. What do you think? This is what I think—
1. It's Sex for Spammers
Take a Mozilla Instance, fire up jQuery, walk the internet as you please, and select the same things over and over again. It changes a whole lot for spammers because you can write exactly one commodity spammers' toolkit to handle exactly 1 Microformat standard (Schema.org) and get reliably clean data. It hurts you. It hurts Google. It hurts the internet. There is no way to avoid this.
Vendors will be ripping off other vendors trivially for their painstakingly massaged data. But Google doesn't care about that. To Google, information is free. Information is free, but facilitating plagiarism on a massive scale of refined embodiments of that information is a big deal. Furthermore the synthesis of the information is copyrighted. We all know vendors scrape other vendors' product PDFs, descriptions, etc., but up until now it required writing some custom code per site. Using tools like Boilerpipe and DiffBot make this easier, but with Microformats this becomes trivial for even amateur spammers with 1 silly little toolkit.
2. It's Broken
For example, the markup for breadcrumbs is idiotic. This was obviously a rush job, and it's not a community effort. Nope. Google owns it, much like Sitemaps. See #4.
The test tool doesn't work, ironically, mostly for the Schema.org format. The preview rarely works—if ever, and it doesn't even understand the (broken) breadcrumb microformat.
I would link directly to the part of the page that describes the specification for breadcrumbs, but their markup is ridiculously bad. I'm not saying I'm a saint, but if you're going to preach about the semantic web, at least use semantically-meaningful markup.
3. It's a Cop Out
Google is admitting its natural language processing just isn't there yet. Much like rel=canonical and friends, this is basically an admission by Google that it can't really figure this stuff out, so you'll just have to do it. It's like asking your friends to diagram their sentences.
Buzz (VB.) off (ADV.)?
The next step is that they get everything force fed to you, use it for Google Products, and don't even want or need your navigation. Trust me, that's next.
On a truly semantic web, Google can take your ItemPages, but ignore all of your CollectionPages. After all, they have your SKU, product name, description, and price. They don't need you. Once everyone realizes how toxic this can be, it will be too late. Much like those people who vow they won't shop at Walmart, the follow-through never really materializes. They're not going to remove the data.
4. It's a Power Grab
See #2. This is a rush job. RDFa-based approaches are much cleaner, but we have to do what Google says. Google also laid down the law on Sitemaps if we recall Google, after all, wants to organize the world's information, so it's natural that they'd also want to control the underlying format. Some others have noticed this. See:
http://www.readwriteweb.com/archives/is_schemaorg_really_a_google_land_grab.php
5. It's Yet Another Thing To Do & Maintain
When you re-template, you're going to have to do it all over again. No, you can't just throw all the data in a hidden DIV. GodGoogle said you'll be turned into a pillar of salt if you try to make your life easier. Whenever you modify your document, you're just going to have to make sure you don't mess up the microformat data.
See: http://www.google.com/support/webmasters/bin/answer.py?answer=1093493#hidden
I'm not in love with this. Not at all, but on some level I'm also disagreeing with Tim Berners Lee, so take this with a grain of salt. Joost de Valk's blog has only positive thoughts on this topic. That's not to say there aren't positive aspects. It's just pretty clear that there are things to worry about.
Related posts:
"7 Wise Comments Banged Out Somewhere On The Internet ..."
A note, at the risk of being pedantic, that the schema.org is a vocabulary for microdata, not microformats. The former relies on markup using HTML 5-specific attributes, the latter addresses class attributes: in this an in many other respects, they are quite different animals. I think you're spot on in your criticism of the rich snippets tool, though I would extend that criticism to all structured data formats that the tool supposedly supports. The tool has never worked well, and as such is limited use to webmasters. I take exception, though, to your assertion that schema.org support is "an admission by Google that it can't really figure this stuff out." This pits structured markup against natural language processing as an either-or situation, where as in fact one augments the other. Mechanisms of the semantic web, of which microdata is one, separate the presentation layer (what humans sees) from what machines see (the data layer). This facilitates more precise classification of resources because of the exactitude that is possible to provide in that data layer. It doesn't matter how "good" Google gets at "figuring out" what a resource is about, it provides richer information about resources that can't reliably be inferred from flat content on its context. The reason product feeds are required for Google Product search isn't because Google can't "figure out" ecommerce sites, but because it allows Google to reliable offer more granular results and refinements based on the data that appears in them. Conceptually, this is the same impetus behind the support of structured data. In turn, information in the data layer can be compared against the content and context (links, site topicality, etc.) in the presentation layer to assess the veracity of both. Spamming using structured data is not as simple as marking up your code: Google and Bing don't inherently "trust" microdata any more than they do microformats, RDFa, an alt or the content of a tag. I too am none too happy that the decision to support microdata and the specifics of the schema.org vocabulary were not a community effort (though it was not, as you suggest, solely a Google initiative, but was jointly supported by the two major search engines … and Yahoo:). However, the recent workshop has started to forge bonds with the semantic web community, including involving W3 in the effort: let's hope this spirit of cooperation continues, and bears fruit in the form of improvements.
Not to mention, making it possible for Search Engines to eventually display your rich data right in the SERPs, possibly denying you a click. One can envision entire recipes being displayed at some point for instance. Great post and great points all around.
@Aaron I'll correct the terminology. Thanks. I'm usually precise, but to me, sometimes data is data. I have some of the same reactions to buzzwords surrounding XML. It's all just data to me. Terminology aside, I think we're substantially in agreement on many points. Even if you're right that this stuff augments search technology (and that it's not mutually exclusive), I think it's still a shift. Google has historically had a very holier-than-thou attitude about doing anything via human assistance. After all, a human can figure out semantics of the document based on appearance. Humans don't need this sort of augmentation. I guess there's a difference as far as scalability if they're making us do it vs. they, but it's still human assistance. I think you're misunderstanding my spam concern. I'm not concerned that this will create the opportunity for on-site spam. They avoid that by asserting that it mark up actual data (see my last point). Rather I'm concerned that it allows for script-kiddie spammers to trivially rip you off. At least before you needed a regex per site or something. Now it's just a bunch of jQuery selectors that will work abstractly for the entire internet! What's really going on? I think Google wants to augment unstructured search with faceted search, much like eCommerce has done (and with much success!). Bag-of-words and proximity only gets you so far. Faceted search is great for data exploration and clarifying polysemy. Once we mark up our documents more, and indicate attributes within, say, our product pages, they can do all this—and we work for them. A further hint of this is here: "They seem like new pages (the set of items are different from all other pages), but there is actually no new content on them, since all the blue skirts were already included in the original three pages. There’s no need to crawl URLs that narrow the content by color, since the content served on those URLs was already crawled." Really? Because nobody is searching for "red dresses?" I think this is a veiled admission, or at least a sentiment, that your navigation is irrelevant. Google wants to be the ultimate arbiter-of-worth and organizer of the world's data. They've never been unclear about that. They want to accelerate the augmentation of unstructured search with faceted search—but they want clean data and can't figure out how to extract it reliably. I was going to write about that next. Am I wrong?
Thanks for your response Jamie. I would only add the additional note that whatever Google's aspirations to be the organizer of the world's data, they neither invented microdata nor are the only ones capable of parsing it.
@Aaron I agree, but they're the ones that will make it actually happen. None of our clients would ever pay us to do it (and again on any redesigns), if Google didn't bless its use at SMX. I'm not going to say nobody cared, but it's suddenly a much more popular item.
@Jaimie: Can't disagree with you there! The same reason many members of the RDFa community we understandably peeved when the announcement of support for schema.org was made.
My biggest problem with a lot of the semantic web stuff is that it's strictly for search engines — kind of steps back from the "Make content for people." The only one i've liked has been microdata, which, though a little span and div-happy, doesn't add the kind of extra, non-intuitive markup that RDFa, Schema, and other metadata like Opengraph require, instead relying on what most designers use for semantic separation anyway: classes and ids. But I don't think it's a failing of natural-language processing to seek out more semantic markup. It's not enough, at least to me, for say Recipes to have a search engine assume an ul is a list of ingredients and an ol is the steps — why not make it easier on everyone and semi-mandate "ul id=ingredients" and "ol id=steps" or however the syntax is set? Glad to see, though, that I'm not the only one who's a bit skeptical about Schema and other Semantic Web stuff…
|
















