Tuesday, December 27, 2005

Adweek: 20% of Blogs are Spam

According to a blog research company, Adweek reports, about 20% of the reported 80,000 blogs created every day are fake. Sometimes it's hard to tell which 20%.

Friday, December 16, 2005

Next step in text mining

Earlier this week I spoke to a group of knowledge workers, called KM Chicago, about the subject of using text mining and visualization as a way to manage the vast amounts of content that business searchers need to wade through. President Jack Vinson kindly just posted about my presentation.

KM Chicago is "Factiva-friendly" but not necessarily made up exclusively of Factiva customers. I'd like to thank one of the directors of KM Chicago, Ann Lee, for inviting me to speak. Ann is a colleague of ours in Factiva's Chicago office.

Blogging and your Corporate Reputation

Factiva has published two white papers this week about how public relations and marketing professionals use blogs and how companies can contribute to this new global "conversation."

PRWeek UK Looks at the Best and Worst of Corporate Reputation in 2005

PR Week UK's annual piece on corporate reputation, titled: The Reputation Rollercoaster, written by Adam Hill, is a good overview of the year's winners and losers in corporate reputation.

Mr. Hill made use of some data our team in London extracted from Factiva Insight. Specifically he shows how Sainsbury's media coverage and its stock price "show remarkable symmetry" during 2005.

Tuesday, December 13, 2005

Text Mining Becomes Sexy

As if things were boring around here, I think Amazon just shook up the world of information retrieval. Its mostly quiet Web-search division, Alexa, is opening the doors to its huge trove of Web-crawled content, allowing text-mining access to the archive. It would seem to me this is text-mining for the little guys -- an affordable way to build applications without having to host billions of documents.

The company plans to make it available at a low price point so that just about any developer who wants to can "search and process billions of documents -- even create their own search engines -- using Alexa's search and publication tools. "

Oh my, GYM just got a wake-up call.

Can little Alexa (with big parent Amazon) do what Google hasn't gotten around to yet or that IBM's WebFountain project has been trying to do for years -- make the Internet one big text-minable database that's easy to use and can produce commerical-grade business information tools? It's too early to tell, but it's all very exciting and should be great watching it unfold.

Tuesday, December 06, 2005

IM2005? Maybe Next Year.

Last week, several of my colleagues from Factiva and I attended the IM2005 awards along with 1,300 of our closest business friends. The intimate affair, held at The Grosvenor House Hotel in London was one of the many awards ceremonies in the IT industry. If you've never been to such a black-tie industry event, picture the Oscars -- but without the celebrities, or the press, or the sexual energy, intrigue, anticipation, dynamism, humor, well you get the picture.

The IMmies (I just made that up) didn't have Billy Crystal, but it did have Barry Cryer, who was said to be back by popular demand. Uh, huh. The IM Web site says he's quite the go-to guy in British comedy, having written for "practically every top U.K. comedian". Maybe his humor just doesn't track well to an American's ears, but I'd have to say some of it sounded more like Borscht Belt, circa 1955, than something written for hip dotcom professionals.

And hip it was. When a project called "Mapping Access Land in England" sweeps the night, winning two awards, you know you're with the "in" crowd.

Oh, did our sleek, hip product, Factiva Insight: Reputation Intelligence win in its category "Product of the Year"? No, but we were up against 29 other products (none of which I'd ever heard of). Hats off to Njini. (Who?)

Well, it is an honor just to be nominated. And the spiced confit of Norfolk duck was good. And I'm usually more of a Long Island duck guy. We missed out on the dancing afterward, as we did split right after the awards portion (16 awards in 30 minutes! definitely not the Oscars) and high-tailed it to Factiva's holiday party. We arrived 4 hours into Factiva's 9-hour fete without a trophy but were welcomed back with glasses raised. We're all winners.

Wednesday, November 23, 2005

Wanted: Corporate Blog Writer

So is the idea of a company hiring someone in the role of "corporate blogger" a passing fad or something many well-established companies will be looking to in the coming years. Seems to me the idea is both bizarre and fascinating. Why would a company hire someone and give them supposed free-reign to talk about the goings on behind the cubicles? Ten years ago the idea would be heresy, but in the fast-pased "new" business world, it seems perfectly logical.

I think this position really works well if the conditions are right: 1) Blogging has to fit the corporate culture (Think: a company which embraces business casual 5 days a week). 2) The company has to be innovating in ways that their potential customers find interesting (who wants to read about chewing gum) 3) The blog does not look, feel or sound like it's written by a committee or by the Marketing Department.

I applaud companies who have tried it as an official position (Stonyfield Farm, GM) or who have allowed it to go on in a more unofficial way (Scoble at Microsoft). I'd like to see more companies take the plunge.

If this is all in place, it sets the company up in a position of thought leadership, drives traffic to its Web site, supports its initiatives and gets the attention of the media. Hey, wait, that sounds like Marketing's job. Maybe so, but at the same time, this could be marketing that works.

Tuesday, November 22, 2005

Score Another one for Joe Q Blogger

Score another one for Joe Q. Blogger working hard to keep another company honest.

This time it was security researcher Mark Russinovich who first reported earlier this month that some of Sony's new music CDs were monkeying around with a Windows "rootkit", helping Sony to prevent copying of the music -- and opening up users' PCs to potential security leaks.

Information Week has a good piece on the news, including this stinging summary:

"Sony made an unpopular product decision and got its reputation incinerated by
waves of flaming bloggers. That's a lesson for other companies."

Sony indeed made a mistake when they tried to brush this off. Why not just take your lumps on day one. "We screwed up, we're pulling the CDs. And we'll make this right." That's the way to go. But during a Morning Edition interview on NPR on Nov. 4, the Sony exec interviewed tried to say that this problem was an esoteric technology thingy that the average person wouldn't care about. "Most people don't even know what a rootkit is so why should they care about it?, " he said. That arrogance hasn't played very well. Hundreds of bloggers have ripped Sony for this. Sony's users might not care about the details of a rootkit but they do care about privacy and their computer's security.

You'd think by now a company with the smarts of Sony would see the potential downside of trying to sneak one by. But oh, the bloggers are watching you...

Marti Hearst: Why are you always No. 1?

I have nothing against Marti Hearst. I met with her once a few years back on the campus of UC Berkeley where she teaches in the School of Information Management Systems. And while this document is a good overview of text mining, I can't figure out why Google has ranked it as the No. 1 hit for the search "text mining" for at least the past two years.

Wednesday, November 16, 2005

Google Gives Nod to Taxonomies with Launch of "Base"

For years, Factiva has preached that document-level metadata rolled up into a taxonomy is a vital piece of a content management solution. For years, Google has not enbraced metadata for organizing content with such vigor. Today, with the launch of Google Base, however, the Internet startup took a small step in that direction. The new offering includes "tags," which are author-applied metadata.
ZDNet's take | C|Net's take

Data Mining Helps Find Needle in Corn Fields

I always love it when I'm driving to work listening to NPR and something I'm not expecting grabs my attention. In this case, a piece on how data mining is helping catech farm insurance cheats.

Wednesday, November 09, 2005

Even More Text Mining Options with Factiva

Factiva announced another partnership with IBM this week and another way our clients can monitor their media exposure and reputation. This shows me that we weren't kidding when we said our relationship with Big Blue would continue even though both sides decided to part ways around the WebFountain platform a while back.

The new announcment talks about reaching customers who are already IBM shops and who need behind-the-firewall, customized solutions. So in this way, it fits nicely as another piece in Factiva's overall text-mining picture.

Down Under, Corporate Blogs Might be Seen as Liability

Ok, so a few days ago a survey of U.S. CEOs said blogs are useful tools.

But here's an opinion from Down Under (Factiva subscription req'd) in the Australian version of Computerworld saying: "Corporate blogging in Australia has stalled because of a perceived security threat and a belief by employers that an active blogger is a liability." According Hydrasight analyst John Brand "less than 5 percent of organizations in Australia actively use blogs as a corporate tool, with some blogs creating an IT security risk," Computerworld reported.

This article doesn't say where that number comes from, but it could be interesting if different parts of the world view corporate blogs differently.

Media Monitor Plus Relaunches

Factiva recently re-released one of its media monitoring products -- Factiva Insight: Media Monitor Plus. The release took the product off the 2B legacy system (which we've been running after Factiva's purchase of 2B Media Intelligence this year) and put it on the Factiva Insight Text Mining Platform.

This allows us to offer a broader content set (blogs, boards, Web and mainstream media from Factiva's archive) along with the speed and intraday update features built by our colleagues from 2B.

I haven't worked directly on MMP to this point as most of my time has been focused on Reputation Intelligence, but I really like this product, too. It's taken several months, but the former 2B products are starting to blend more with the Factiva look and feel, which is great.

I'm also starting to learn more about how clients are viewing it. Some of them see it as we expected -- a tool to more efficiently monitor their media coverage. Others, interestingly, see MMP as way to help their teams get up to on a sector or an industry. For example, say Acme Corp. is a consultancy that works with 10 different industries. It's vital for the Acme consultants working in the auto industry to stay abreast of the news and trends in that industry so they can speak intelligently and understand the issues of their clients. With Media Monitor Plus, Acme can set up 10 dashboards, one for each industry, so each group of consultants can stay on top of the big picture in their industry.

Tuesday, November 08, 2005

This is really how we text mine

The Text Mining Platform explained.

Study: CEOs find blogs useful

Here's a good item supporting the Blogosphere's future:

About 59% of CEOs said blogs are useful for internal comms and 47% see them as useful for external communications, reports CNet on a study by PRWeek and Burson-Marsteller.

Monday, November 07, 2005

Factiva CEO Among Finalists

Many of us are proud to work for a company that gets recognized for its products -- and its people. Our boss, Clare Hart, was just named a finalist for the New York Ten Awards, for significantly impacting New York business innovation through technology.

MSM lagging behind the corporate blog story

The mainstream media is still in "hey look at blogs" mode. I'm disappointed. I'm looking for more insight and anecdotes about how corporations are leveraging blogs, but I'm not finding it. It seems to be the same stories everyday about Kryptonite locks, Microsoft's Robert Scoble, etc.

The local newspapers in the U.S. are still slowly rolling out their: "what is a blog?" articles. Even the Financial Times felt obliged last week to state "...Weblogs, or blogs, ..." And the had the trite lead: "To blog, or not to blog? That is the question vexing marketing managers ...."

The Wall Street Journal* had a good piece on the value of blogs, but they, too, felt obliged to define them in the lead:

"IT USED TO BE rare for an established, mainstream company to buy an
individual's personal blog. Blogs are frequently updated online journals,
written by pretty much anybody -- professionals, hobbyists or regular Joes
reaching out to share their thoughts, information and photographs with

The New York Times for the most part is hip to the blog story. They don't feel the need to define blogs in every article, when the context makes it obvious.
*Full disclosure, my company, Factiva, is half owned by Dow Jones, the publisher of the WSJ.

Thursday, November 03, 2005

Visualizing Complex Data Relationships

I came across visualcomplexity.com, a very nicely done compilation of data visualization samples, put together by a New Yorker named Manuel Lima, a self-described interaction designer and information architect.

Wednesday, October 26, 2005

Meeting Media Monitoring Users

I've had a chance to meet with some clients over the past few weeks to talk about how they are using media monitoring tools (both ours and those from other companies) and how they're working the metrcis they gather into their information-employees' workflows. It's always interesting to get the perspective of those who are buying the tools Factiva and others are touting. It helps focus us on delivering value, not just building tools.

I really feel strongly that market validation is a vital piece of product development. Typically when you talk to users you'll hear feedback that's not unexpected. But what keeps it interesting is that you always get some comment, some new perspective, some advice you weren't expecting.

It seems that for every client, there's a unique use case.

Friday, October 21, 2005

Who Are These Sploggers, Anyway?

An interesting post over at the Intelliseek blog about the growing problem of blog spam, written by Intelliseek employee Robert Stockton, who describes common behaviors and methods of this growing menace.

I'm not sure we need another word to describe them though. "Sploggers"? Ugh. But from a linguistic perspective, it's fascinating how words form so quickly in cyberspace.
web log > weblog > blog > blogging > blogger > spam + blog = splog > sploggers

Tuesday, October 18, 2005

BlogOn: The Oft-Mentioned Long Tail

"The long tail" was mentioned at BlogOn 2005 several times by presenters and overheard in the hallway as well. One of the first to talk about this concept as applied to the Blogosphere was Clay Shirky.

Basically, this is the idea that most of the traffic in the blogosphere is coming from a very small number of authors and a very large number of authors (the long tail) are creating on average a small number of posts each.

This idea is tied to Zipf's law, named after George Kingsley Zipf, a Harvard linguistic professor. Jacob Neilsen also recently wrote about Zipf's curves.

However, it was pointed out by presenter David Weinberger that the area under the long tail is larger than the area under the large head, as it were. Which means... what exactly?

BlogOn: Podcasting to Text

The question of whether there is value in having podcasts transfered to transcripts came up during a panel discussion at BlogOn 2005 conference. The panelist, Michael Geoghegan, of Willnick Productions, being asked about the value of speech-to-test for podcasts said he saw no value in this because his podcasts are meant to be heard and that the emotion that comes across in his voice would be lost in transcript form.

I think he's missing the point. Once podcasts are transcribed they can be searched and text mined. This adds additional use to the podcasts that otherwise have a limited distribution. Without being able to search or mine podcasts most of their usage is going to come from browsing and category searching. For example, if I search for "Pinot Gris" in a podcast search engine I will likely miss the podcast that mentions "pinot gris" because the podcast description might not mention specific grapes and wines.

BlogOn 2005: A Diverse Attendee List

I'm at day two of BlogOn 2005 in New York City. Presentations and panels have leaned toward discussing how the Blogosphere and the business world are coming together.

The attendees are quite diverse. There seem to be a mix of Blog geeks and newcomers to the space ("what's podcasting?" one attendee asked a panel). One woman I met has never posted to a blog before but was here because her boss instructed her to find out more about the industry.

Many people here seem to be vendors, industry analysts and journalists. And there are a surprising (to me) number of PR, marketing and advertising professionals here. It seems those industries are trying to get up to speed quickly on this growing internet-based conversation is all about.

I think the conference will have to evolve next year to be more useful. The topic of "blogs" is too vague to support well focused show.

Thursday, October 13, 2005

Google and Information Extraction

Google "information extraction" and what's the No. 1 sponsored link?

Work at Google

Google is hiring expert computerscientists and software

I've never really though about Google being a player in the information extraction sector (aka entity extraction, text analytics, text mining). Sure there's lots of talk about what's the next big thing for them -- free wireless access, video search, indexing the world's libraries. It's fun to think about that stuff.

But when it comes to improving their bread and butter -- search -- mostly I picture their focus being on refining their ranking algorithms and optimizing their crawling strategies. But on your way to being the one-stop shop for all information, I guess it should have been obvious to me that text mining would be a station on that route.

One place we see TM showing up clearly is in Google News, with the "In the News" list of oft-used phrases of the day. I'm sure there are many more examples.

Monday, October 10, 2005

Lies, damn lies and text mining statistics

LA Times writer Brendan Buhler took TV newswriters to task Sunday for their overuse of the phrase "get a handle". OK. I hadn't noticed such a growing phenomena, but no matter.

He proved his claim by running a LexisNexis search on the phrase over five years and said its use is rising every year. ("It was in 3,504 stories in 2004, nearly 700 more than 2000. ")

I found this to be creating truth where none exists by a fast use of text mining. I have two concerns:

1) What were the context of these references? I searched "get a handle" in Factiva and found several mentiones of that phrase in a oft-sited direct quote ("Firefighters were able to get a handle on this early on," said Capt. Jason Neuman of the California Department of Forestry and Fire Protection.) Does that make the phrase more common or is it just a function of the phrase being replicated by the distribution of AP wire copy.

2) Did Mr. Buhler account for any changes in the universe of publications and/or documents over that time period? The number of mentions in one year versus another needs to be compared to the total documents in each year. When I ran the "get a handle" search in Factiva's top 50 U.S. Newspapers (a more controlled group) and then compared it to all documents each year in that group, I found the rate of mentions of the phrase rather flat year on year.

Ah. Lies, damn lies and statistics.

Saturday, October 08, 2005

Google News: A study in text mining

You've got to wonder how much the large media companies like CNN and BBC who each have their own firm presence on the Web hate Google News. Sure GN is pointing readers back to their sites so they get the eyeballs but it's just as likely that a user will click through to kentucky.com as CNN.com, (though there's evidence of late that GN is weighting the top sources higher now and kentucky.com is less likely to be the lead link) when previously that same reader would have just typed cnn.com into their browser and got their news from one source.

You get the feeling the relationship might be one of smiling through gritted teeth.

Google News is using the power of text mining to leverage the editorial might of many editors and news rooms. CNN, AP, NY Times, FT are all making decisions of which news item is most important and arranging their landing pages accordingly. They're paying human editors to make these value judgments. GN comes along and in the aggregate scoops up all this knowledge (text mining!) and creates a viable competitor for the best news sites out there out of whole cloth.

The irony is that GN needs its news providers for the knowledge of what's most important. So it needs the CNNs of the world to remain successful so they can keep feeding off them. CNN's ad supported model needs the clicks. Symbiotic? Perhaps, but I think Google is benefiting more.

Tuesday, October 04, 2005

More Forum Follow-Up

Here's perspective from Christopher Kenton, SVP of Strategic Planning at GlobalFluency, on his participation in a panel discussion at Factiva Forum last month. The session was called Blogs and RSS: Friend or Foe, Fad or Future. Also on the panel were Sandy Hamilton, EVP of Sales and Marketing at NewsGator; James Brancheau, Managing VP at Gartner; and David Scott, Author of "Cashing in With Content". Chris's summary: the panel presented a "broad and optimistic view of the power and utility of unstructured and unfiltered content, even in the face of significant challenges."

Friday, September 30, 2005

Blog Readership, RSS Increasing, Forrester Reports

Ten percent of consumers read blogs once a week or more, said Forrester Research at the opening of its annual Consumer Forum. (You and I are still in the minority, but the tent is getting bigger.)

Automated Sentiment Detection

I've been doing some research to understand the state-of-the-art of automated sentiment detection. This natural language processing technology is still pretty young and from what I'm sensing no one's really commercialized anything that delivers high-quality results yet. There are a few players out there (you know who you are) who have products in this area, but it's tricky stuff.

Sarcasm, irony, double negatives all wreak havoc with automated detection.

Much of this work is still at the university level and the papers published in the area focus on trying to detect the opinions of authors in movie reviews, hotel ratings, etc. The recent interest in monitoring blogs is spurring more discussion in the commercial space.

So it's not too surprising that the industry hasn't settled on terminology yet either. I've seen a host of words being used to describe this process of assigning a positive or negative score to an article -- tone, tonality, polarity, affect, sentiment, favorability (or favourability, if you're in the UK), opinion, mood. I generally use the term sentiment because it's had the most pickup.

There are also different types of sentiment assignments. We can talk about it from the perspective of the author or the perspective of the consumer of the information. For example, a hurricane can be written about as a negative event. However, to the construction industry it's a positive event because it means the beginning of a rebuilding boom. It's not clear to me which terms should be used to describe these different perspectives. Is "sentiment" the view of the author and "favorability" the view of the reader? Not sure.

Tuesday, September 27, 2005

European Blogosphere

Well-known French blogger Loic Le Meur has on his wiki an attempt to attach some metrics to the oft-disputed size of the Blogosphere. He's focusing on the European Blogosphere, gathering numbers per country, listing top blog hosting sites and key bloggers per country. This is a very important resource as we try to get our hands around the growth of this social phenomena around the world.

Monday, September 26, 2005

Causes of Content Chaos: Everyone's a Publisher

I talked at Factiva Forum about a few causes of the content chaos we all find ourselves awash in: (Everyone's a Publisher, "Markets are Conversations", The continued movement toward more dynamic news cycles).

I think those of us out here in the Blogosphere are pretty comfortable with the idea that the growth of blogs has created a world where "everyone's a publisher." I'm not a journalist but I can publish my thoughts just as my colleagues upstairs at The Wall Street Journal can. (Factiva is owned by Dow Jones and Reuters and we share some real estate with our parents.) Certainly, not nearly as many people are going to read what I have to say about technology as, say, Walt Mossberg, but in the collective bloggers are becoming a force. Our opinions are being read by others and in the 500-channel world each channel takes up equal space on the dial.

Studies have shown (that's a great phrase when you don't have real metrics) that people are more likely to trust their peers for an opinion than someone in authority (government, corporations, the media.)

So as bloggers become more of a driver of opinion and more of us become bloggers, corporations and governments had better keep watch.

Blog Numbers Likely Inflated by Spam

We keep hearing about the size of the Blogosphere increasing like crazy. Technorati talks about it doubling every five months. However, what's not being said by all the bloggers who are blogging about blogging is that a growing number of these blogs are spam.

We've heard from a company which processes a large number of blog posts that about a quarter of the posts they see are spam.

Have you ever clicked the "next blog" link at the top of this page (and most other Blogger blogs). How many of the blogs you come to are for online casinos or inkjet cartridges. It's really quite amazing how much spam has moved into the Blogosphere in the past few months (my observation, not based on any particular metrics).

I think it would be great if the main players in the industry -- GYM, Techorati, Intelliseek, Six Apart , etc. -- put their virtual heads together to find ways to systematically slow down the wave of spam before we're up to our eyeballs in it. Blogger's flagging (to allow individuals to notify Blogger about objectionable content) seems like a good idea. But it remains to be seen how well it works.

I Think Factiva Matters, Too

Wonderful to read today that Barry Graubart, EVP & Chief Marketing Officer for Leadership Directories, has added Factiva to his list of one of the 50 content companies that matter -- joining the likes of Google, Yahoo! and Wikipedia.

I must agree, (but that's why I come to work every day).

Friday, September 23, 2005

Seth Godin

The keynote speaker at Forum was Seth Godin, an author of several books on the "new" marketing, including Purple Cow. He spoke about one of the main reasons people buy -- because it makes them feel good. This isn't revolutionary. We all know Hummer's are expensive because people are willing to pay a lot to show off their status. And people buy iPods not because they make your digital music sound better, but because they're so darn cool (which they are).

So, it's hard to disagree with him and he's very compelling because he's such a wonderful public speaker. But I think of bit of what he said is overstated. I'm not sure B2B is the same as B2C in this regard, though he said he thinks it is.

Sure, no one ever got fired by buying IBM. And sure, your CEO will listen to you more if you have a consulting report from McKinsey & Co. than from Mike McKinsey LLC (even if both recommend the same thing) so I understand his point there. But I still think that emotional buying is MOSTLY in the realm of consumer products, not multimillion dollar server farms or jet engine parts.

Factiva Forum Part 2

In my Content Chaos presentation at Factiva Forum, I talked about what I see as the causes of the swirling mess of content we find ourselves in. I agree with something Bill Gates said recently: we're no longer faced with long-discussed problem of "information overload" (having too much information that's too difficult to manage) but we're really in a position of not having an easy way to get to the RIGHT information when we need it. Gates goes so far as to say we don't have enough information.

GYM (Google, Yahoo! and Microsoft) has gone a long way in providing us with pretty relevant documents on the first page, but better Web search is only part of the answer. There is plenty of information burried on pages 2 through N that we never see. We have to find ways to get that information into the path of our research.

So companies in the information industry are trying to help their clients get to the answers, not provide them with documents. I mean, no one is really running a search so they can find documents. They're running a search, so they can find answers. Moving technology forward so it can get people closer to the answers is our focus.

We see text mining as a big part of that solution.

I also talked about a few causes of this content chaos.
•Blogs Mean Everyone’s a Publisher
•‘Markets are Conversations’
•More dynamic news cycles

I'll write more on this soon.

A Funny Thing Happened on the Way to Factiva Forum

(Actually, nothing particularly funny happened, but I really wanted to use that pun. )

I was honored to be asked to speak yesterday at Factiva Forum, an executive conference event held high over Times Square, in the Reuters building. This year's NY version of Forum focused on the ever-quickening pace of news and business information and focused on some of the drivers of it -- the growth of business-oriented blogs, RSS feeds, etc.

I spoke on a panel discussing reputation management. I was asked to talk about one of Factiva's products, Factiva Insight: Reputation Intelligence.

I've been involved in the development of this product over the past few years (starting with our relationship with IBM's WebFountain and through our purchase of a company called 2B) and so I'm a bit biased when I say Factiva's is one of the most comprehensive approaches available today for companies to follow their reputational issues -- how they're being covered in the MSM, what people are saying about them on blogs and boards and the differences between the two.

Also on the panel with me were John Neeson, the co-founder of Sirius Decisions and Judi Frost Mackey, the Director of U.S. Corporate & Financial Practice at Hill & Knowlton.

John talked about the results of some studies they've been conducting in this space. Judi presented a real-world example of how a client of theirs (some retailer from Bentonville, Arkansas, I believe) has attempted to revive their sagging public image through a recent media blitz.

I also presented on the subject of "Content Chaos" (see my next post.)

Thursday, September 15, 2005

Text Mining is a Service with Many Uses

We've been talking around Factiva about how text mining (i.e. extracting meaning from unstructured text) can benefit varied business use-cases. There are lots of great ideas swirling about and it's not entirely clear where it will land. But this much is clear to me, text mining has to be part of the future of information retrieval.

But text mining is a service or a process, not a product, so we aren't marketing "Factiva Text Mining" but we are going to look for ways to fold it into products.

Factiva has developed text mining capabilities and built products for media measuring and reputation management. But searching and alerting can benefit too from the philosophy of text mining.

(An aside: Wikipedia needs a better entry for text mining. Volunteers?)

Corporate Blogging Survey

For those interested in corporate blogging, the results of a survey have been posted by a Boston-based, internet-marketing company called Backbone Media. Here's some of what it found.

"The survey respondents indicated that they believe there is a broad array of benefits to starting a blog including: quick publishing, thought leadership, building community, sales and online PR."

"The biggest concern about starting a blog was the time needed to devote to the blog; the next concern was legal liability. A slight majority of bloggers took less than 1-2 months to start their blog after initial management review. ... Once they started, bloggers saw immediate results from publishing content & ideas quickly. Search engine rankings & links results appeared before sales. Overall, thought leadership and idea sharing were the biggest benefi ts for bloggers."

Wednesday, September 14, 2005

Blogsearch from Google -- Finally

So, Google's new Blog Search beta is pretty good. I've seen some spam on the first page of results, but not too bad. And the "related blogs" are not quite there yet, though that will, no doubt, improve.

But the surprising thing about Google's blog search, to me, is that it's taken them so long to launch it. Those who like to search blogs have been waiting for it, but Google, until now, has offered no real way to search them because a blog's update cycle is so much shorter than general search engines are tuned for.

For now, Technorati and Intelliseek still have an edge in blog search, but they've got to be looking in their rear-view mirror at the speeding bullet coming up behind them wondering how they're going to keep that lead. Innovate and specialize, guys!

Good to see Dave Sifry, Techorati CEO, is taking a "bring it on" attitude.

(BTW, should I read anything into the fact that Blogger's spell-check suggestion for "technorati" is "degenerate"?)

Tuesday, September 13, 2005

Factiva Forum

I will be speaking at the next Factiva Forum, Sept. 22, in New York on what the information professionals need to know about text mining and visualization. It will be a primer but will also go into how these technologies may be impacting their roles in the enterprise. I'll post more as we get closer to the date.

Monday, September 12, 2005

Scoble on Corporate Blogging

Here's an interesting interview with Microsoft's Robert Scoble on corporate blogging. I came across it when I was researching growth of French blogs. He was interviewed by Loic Le Meur, who runs a pretty well-read French blog. This is his first interview in English.

Scoble talks about why blogging is very important for those in product development and design -- including that how it allows clients to interact directly with product managers, not indirectly with customer service. He also says that it's vital for companies to be involved in the conversation and be monitoring the conversation because otherwise the story's going to pass you by.

Scoble might be seen as one of the most influential people at Microsoft (if not one of the 10 most well known). Yet he comments in this interview that he's seven levels down from Bill Gates in the corporate hierarchy. Now that's a statement of how empowering blogs can be.

Business Blogs as Next Growth Area

Business blogs -- whether published by an individual talking about her profession or a company talking about its products -- continue to grow in importance as more people start thinking about blogs for things other than politics, culture and journals.

Publishing has always been a way for academics to distiguish themselves among their colleagues. Maybe now we're seeing an easy way for the business world to do something similiar. Of course there are many differences between the two. In academia, your reputation is probably more closely tied to your published works. Your career might depend on whether your papers are seen as well researched or hogwash. Blogs are going to foster the publishing of quick commentary, not the reasoned research. But nonetheless, blogs do offer the ability for those in business to establish themselves as thought leaders.

David Scott talks about how the
corporate blog is emerging in 2005 as a growth area. I see corporate blogs currently as a small segment of business blogs. Corporate blogs are likely extensions of the PR department or the CEO's office. They can be useful ways for a company to get their messages out in a folksy manner. But they can also be seen as shameless shilling.
A corporate blog will work will if it addresses some esoteric interests of a company's products or if it furthers a certain image the company is trying to portray, like
Stonyfield Farms.

But beware -- bloggers and (likely) readers of the Blogosphere are savvy. A marketing web site that tries to masqurade as a corporate blog will turn people off quickly. Remember the McDonald's Lincoln Fry blog? It was a superbowl ad campaign that McDonald's tried to foster with a fake blog purported to be written by someone who found a french Fry in the shape of Abe Lincoln. Uh huh. McDonald's said the blog helped the campaign last a little longer in the minds of the public. I doubt it did. I think it just made the company lose cred in the Blogosphere. Did they really think they were going to pull some sort of Blair Witch?

Thursday, September 08, 2005

Travel Boards Should Monitor the Blogosphere

I got a call last week at home from a marketing company doing research. My guess, after going through the questions, is that it was commissioned by the Maine Department of Travel and Tourism and that they are working on a new slogan to attract people to vacation in Maine. They asked me to describe Maine and the other states in the area with adjectives. Then they read various slogans to me to get my reaction. It seems they were looking for keywords that would attract people to Maine (rocky coast, forests, kayaking, stargazing? etc.)

It struck me that travel boards would benefit from tools that can monitor the Blogosphere to see what words people are using to describe their vacations.

(It also strikes me that Maine does need a new slogan. "It must be Maine"? What the heck does that mean.)

Spam, Spam, Spam

Well, you know the Blogosphere has established itself as something more than a playtoy for geeks (not that you needed convincing). Spam is here.

Spam blogs and fake blogs are starting to spread as rogue merchants try to boost their rankings in traditional search engines and in blog search engines. By creating lots of blogs that link to each other, they are trying to make their blogs seem influential.

Nothing new here. They're using the exact same tactics they used when they created farms of fake Web sites. And, let's face it, blogs are just Web sites.

What this means is that the leading aggregators in the Blogosphere -- Intelliseek, Technorati and others -- are starting to put measures in place to spot the fakes. Dave Sifry of Technorati wrote an oft-sited post.

As the aggregators start to catch the fakes, the fakers will try to out-fake them. The cat-and-mouse game begins

Factiva Insight shows new discussions about Apple phone

I found out about big news afoot at Apple last week after I logged into the Factiva Insight: Reputation Intelligence product. The system discovered new words popping up around Apple -- such phrases as "cingular wireless, special event, mobile phone" -- as rumors about a coming iPod phone swept the Web. This is a good example of how the product can be used to find new discussions taking place.

Thursday, May 26, 2005

The Genre of Blogs

I've been spending some time studying the genres of blogs. Specifically, I've been looking at the most popular, most linked to, most prolific blogs. You can see lists of them on Technorati and Intelliseek.

I've found, in my un-scientific study that most of these "top" blogs are what I'd call pundits (
Instapundit, Eschalon, and Daily Dish for example). Often they are focused on politics -- American politics to be specific.

There are a few traditional journalists at the top of the list. And there are the smattering of humor blogs, consumer reviews, IT and music blogs.

But only two or three out of 100 are the classic blog genre -- diaries or journals. The most popular of these right now are
Dooce and Baghdad Burning. The former is probably popular because it's a bit quirky and well written. The latter because it's a compelling look at the author's daily life in the center of a war.

However journals are what most people create when they first create a blog. They talk about their lives. Not very interesting stuff to the average reader, unless the author is a very compelling writer or is living a very interesting life.

Monday, May 23, 2005

Feeling a Bit Alone

Maybe I'm not looking in the right places but I've found that there are not too many blogs out there who are talking about text mining and unstructured data visualization.

Part of me thinks this has more to do with the quality of searching across blogs than it does with the availability of the content.

Technorati and Intelliseek's Blogpulse offer pretty comprehensive searchable indicies of the Blogosphere, but I never walk away feeling the most relevant blogs are coming to the top. Bloglines is talking about a new offering in this space. But there is no Google of blogs yet.

Hmmm. Why isn't Google the Google of blogs? You certainly get blog hits back with your regular search, but blogs are real-time discussions and a search engine like Google is mixing them in with all the other sites it's updating. Not sure what the conventional wisdom is right now on how frequently Google reindexes "the entire" Web. Certainly some pages are intraday, but most are probably closer to weekly or monthly. Too slow to work for Blogs.

I'm keeping my eyes on
Google Labs. Inevitably, there will be a blog search coming along. Won't there?

Friday, May 20, 2005

Text Mining Summit

I'm looking forward to attending the Text Mining Summit in Boston http://www.textminingnews.com/ on June 7 and 8. Most of the heavy hitters involved in text mining, like Factiva, will be there (oh and Cingular, Hewlett Packard, Pfizer, EDS, AOL, Abbott Laboratories, In-Q-Tel, IBM, Oracle, ClearForest, Inxight, Cognos, SPSS and SAS) talking about their approaches. Should be a great two days.

Extracting Sentiment

There have been many attempts in the industry to find an way to automatically extract the sentiment (tonality) from unstructured text. I think there will continue to be progress in this area as more and more businesses are trying to find out not just how much is being written about them, but what the tone of the writing is.

Our friends at Intelliseek have an approach,
http://www.intelliseek.com/technology.asp#sentimentmining and so do many others.

The challenges of automatically extracting sentiment are many. It is very difficult for computers to extract meaning from running text. Nuances of language, wit, sarcasm, irony are virtually impossible to detect. Even double negatives can make NLP software confused.

Add to that the issue that sentiment is in the eye of the behold. What's good for one company often by definition is bad for its competitors. Is the answer to extract the sentiment of the author? Perhaps.

All of this leads me to believe the only way you can build a commercial software package to extract sentiment is by integrating a human in the loop -- either as a final check or as a trainer of the system.

I'm not quite sure of the details of how that should be done, but I'm pretty certain that the answer lies in there somewhere.

Thursday, May 19, 2005

Wendy's Reputation Struggle

When a Wendy's customer said she found a piece of a human finger in the fast food chain's chili, the company reacted quickly, conducting an internal investigation. It quicked proved to itself it hadn't done anything wrong and offered a reward to find the perpetrator of what we now know was a fraud.

While the company handled the situation as well as it could have and is recovering, it will take a while for its sales to return to where they were.

Even though I know there never was a finger in the chili at Wendy's, the image will stick in my head for a while -- and that might make me (at a subconsious level) avoid eating there. And that's why corporate reputation is so delicate. Wendy's did nothing wrong -- and everything right in the aftermath, but is still suffering. Not fair, but the hard truth.

A well-stated commentary -- See the article in Factiva (subscription req'd) -- by Jack Schuessler, CEO, Wendy's appears in the Wall Street Journal this week. He stated: "The disturbing truth for everyone in the business community is that a devastating fraud can be perpetrated by a single individual. And the ramifications to a company's reputation are frightening."

This is another example of why it's vital for companies to be ever on the lookout for incidents. The faster they react (as long as they react in a way that shows they have nothing to hide) the better it will be for them in the end.

Software tools that allow monitoring of public discussions in the blogosphere and the local media are vital in the process.

Wednesday, May 18, 2005

Blogosphere Timeline

Just somthing I put together, inspired by Gartner's approach to analysing new technologies, to try to see where blogs are on the continuum of growth and acceptance. Using this model, there will likely be a drop-off in popularity in the next couple of years, followed by firm acceptance as part of the mainstream communication technology landscape.

A few good quotes:

“Blogs are the best thing that's ever happened to
journalism. Or they're going to kill it. One or the other.”

-- San Jose Mercury News,
April 18, 2005

“…you cannot afford to close your eyes to [blogs], because they’re simply the most explosive outbreak in the information world since the Internet itself."

-- Business Week
May 2, 2005

Monitoring Blogs

With the volume of unstructured content continuing to grow text mining becomes an obvious way for companies to manage the information being generated about their subjects and their issues.

The growth of the Blogosphere alone creates a new stream of data that many companies are naively ignoring (partially because they're not sure how to monitor them). With stakeholders and journalists reading and writing about your company it seems to me that a company would be foolish to not monitor what's being said.

Monitoring blogs in a comprehensive way allows companies to be able to find dramatic changes in their landscapes.

Monday, January 17, 2005

Blogs and Categorization

I find it very interesting that the blog world is looking at categorization as a way to start putting some order around this growing corpus of data.
Technorati: Tags

Companies like Factiva discovered more than a decade ago that comprehensive, univeral metadata on content is vital for organizing and managing large bodies of content.

It seems that the blogosphere is no different than any other corpus of unstructured data in this way. It will be fasinating to see it grow as the Web did.

Remember when Web directories were all the rage in the mid to late 90s because this growing thing (the Internet) needed some order put around it? -- I was involved in Dow Jones's take on that (Dow Jones Buiness Directory). We did it because we knew our clients were looking at the Internet and needed advice as to how best use it efficiently.

Now, directories have becoming largely obsolete largely because Google has made searching for information much more efficient than Yahoo and Excite and the others were in the mid 90s.

It seems obvious to me that companies will be stepping up now to put more order around the Blogosphere. Categorization is one of those things.