<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Rittman Mead Consulting &#187; Data Mining</title>
	<atom:link href="http://www.rittmanmead.com/category/data-mining/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.rittmanmead.com</link>
	<description>Delivering Oracle Business Intelligence</description>
	<lastBuildDate>Mon, 06 Feb 2012 21:18:16 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Oracle Sales Prospector</title>
		<link>http://www.rittmanmead.com/2008/07/oracle-sales-prospector/</link>
		<comments>http://www.rittmanmead.com/2008/07/oracle-sales-prospector/#comments</comments>
		<pubDate>Mon, 28 Jul 2008 20:53:44 +0000</pubDate>
		<dc:creator>Mark Rittman</dc:creator>
				<category><![CDATA[Data Mining]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2008/07/28/oracle-sales-prospector/</guid>
		<description><![CDATA[I&#8217;m just sitting in my hotel room in Galway, listening to the Galway Races going on down the road and working through an online demo of Oracle&#8217;s new Sales Prospector application. It&#8217;s actually quite an interesting application of Oracle&#8217;s in-database data mining technology, something that&#8217;s relevant to me as I&#8217;m putting together a section on [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m just sitting in my hotel room in Galway, listening to the <a href="http://en.wikipedia.org/wiki/Galway_Races">Galway Races</a> going on down the road and working through an online demo of Oracle&#8217;s new <a href="http://sales.oracle.com/en-us/">Sales Prospector</a> application. It&#8217;s actually quite an interesting application of Oracle&#8217;s in-database data mining technology, something that&#8217;s relevant to me as I&#8217;m putting together a section on Oracle Data Mining for the <a href="http://www.rittmanmead.com/2008/07/15/oracle-11g-data-warehousing-seminar-denmark/">DW seminar I&#8217;m running in Denmark</a> in September. What Sales Prospector does is take several of Oracle&#8217;s data mining algorithms and apply it to the job of identifying which products a set of customers might be interested in.</p>
<p style="text-align:center;"><img src="http://www.rittmanmead.com/wp-content/uploads/2008/07/sales-prospector-1.jpg" height="444" width="500" border="0" hspace="4" vspace="4" alt="Sales Prospector 1" /></p>
<p>The application itself is one of the first Fusion applications and I initially saw a demo of it at Open World last year. I think it&#8217;s actually available for licensing now and is typically used on a hosted basis, with customers&#8217; sales people uploading data to the application and the back-end system then analyzing it. Where Oracle Data Mining comes in is that it uses its algorithms to take each customer and predict which (if any) products each of them are likely to buy, generating predicted income figures and time to close based on its calculations. It then clusters the customers and their predicted product interest and value and places this on an Flash-based bubble-graph for users to interact with.</p>
<p style="text-align:center;"><img src="http://www.rittmanmead.com/wp-content/uploads/2008/07/sales-prospector-2.jpg" height="633" width="494" border="0" hspace="4" vspace="4" alt="Sales Prospector 2" /></p>
<p>The other thing that the application does with the ODM clustering feature is to work out which other customer might provide a reference for the product that&#8217;s being predicted, based on similar attributes and predicted needs. All of this is put into a single Flash-based application that&#8217;s aimed at salespeople and account managers.</p>
<p style="text-align:center;"><img src="http://www.rittmanmead.com/wp-content/uploads/2008/07/sales-prospector-3.jpg" height="633" width="406" border="0" hspace="4" vspace="4" alt="Sales Prospector 3" /></p>
<p>Obviously this is a fairly sales-focused application but its an interesting application of the Oracle Data Mining technologies. My understanding is that ODM is also going to be used at Oracle Open World to recommend sessions for attendees based on their profile and existing selection of sessions, as I&#8217;ll be over there it&#8217;ll be interesting to see how effective this is.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2008/07/oracle-sales-prospector/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Looking at the unstructured</title>
		<link>http://www.rittmanmead.com/2007/10/looking-at-the-unstructured/</link>
		<comments>http://www.rittmanmead.com/2007/10/looking-at-the-unstructured/#comments</comments>
		<pubDate>Sat, 06 Oct 2007 21:43:30 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[BI (General)]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2007/10/06/looking-at-the-unstructured/</guid>
		<description><![CDATA[Curt Monash has been talking about text mining a lot this week, he also notes that, from a text point of view, that the four preeminent database vendors (data store not query tool) are Oracle, Microsoft, Teradata and Netezza, this seems to reflect my experiences of what is going on in the BI space as [...]]]></description>
			<content:encoded><![CDATA[<p>Curt Monash has been talking about <a href="http://www.dbms2.com/2007/10/05/the-four-horsemen-of-data-warehousing/" target="_blank">text mining</a> a lot this week, he also <a href="http://www.dbms2.com/2007/10/05/the-four-horsemen-of-data-warehousing/" target="_blank">notes</a> that, from a text point of view, that the four preeminent database vendors (data store not query tool) are Oracle, Microsoft, Teradata and Netezza, this seems to reflect my experiences of what is going on in the BI space as a whole.</p>
<p>I first started out in the intelligence space (deliberately omitting the word <em>business</em>) working on high performance text stores. In reality this was all about storing streams of text in a highly searchable form, in effect the text was held as an index, here I came across then esoteric concepts as tokenisation, tagging, synonym dictionaries, and stop-words; however, I did not use stop-words as every word stored (even if misspelt) was potentially significant. Since those days analysis of free-text has become more mainstream. Linguistics academics use it to unravel language semantics (my wife&#8217;s dissertation was on computational linguistics) Others I know have worked on voice to text systems capable of identifying speakers and what they say. But although text is the most accessible (or understandable) form of unstructured data, increasingly people need to search binary data, whether it be biometrics, images, DNA sequences, spatial data or whatever.</p>
<p>Finding single records that contain text fragments, topics, DNA patterns can be an interesting indexing challenge, but when we start to move into the space where we looking for patterns that occur in multiple records we start to move well away from the normal data mining domain. Determining that a&nbsp; particular DNA sequence correlates with increased risk of disease is a complex task; there is little reason (apart from the enormity of the computational task) that DNA samples could not be used to predict facial appearance. Computational problems exist in text analysis especially where rich vocabularies make semantic matching difficult &#8211; so how do you get cookie to match biscuit in some contexts and biscuit to match brown in others.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2007/10/looking-at-the-unstructured/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Data-centric BI trends</title>
		<link>http://www.rittmanmead.com/2007/10/data-centric-bi-trends/</link>
		<comments>http://www.rittmanmead.com/2007/10/data-centric-bi-trends/#comments</comments>
		<pubDate>Sat, 06 Oct 2007 14:59:09 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[BI (General)]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Data Warehousing]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2007/10/06/data-centric-bi-trends/</guid>
		<description><![CDATA[I managed to miss Mark Rittman&#8217;s recent keynote addresses in two locations this week, but I did take a look at the slide deck and the blog post. Mark is making five very sound points on the Oracle BI stack. As most regular readers of this blog might expect, I am probably more data driven [...]]]></description>
			<content:encoded><![CDATA[<p>I managed to miss Mark Rittman&#8217;s recent keynote addresses in two locations this week, but I did take a look at the slide deck and the <a href="http://www.rittmanmead.com/2007/10/05/five-oracle-bi-trends-for-the-future/" target="_blank">blog</a> post. Mark is making five very sound points on the Oracle BI stack. As most regular readers of this blog might expect, I am probably more data driven than Mark (not that Mark isn&#8217;t focused on data, it&#8217;s just that we have skills that compliment each other&#8217;s); I am truly happy modelling data or making a physical database layout fly. So this piece is a slightly data-centric take on current BI trends.</p>
<h5></h5>
<h4>Why is there a &#8216;B&#8217; in BI?</h4>
<p>Well there needn&#8217;t be. A lot of what we do is for businesses, so the <em><strong>B</strong></em> makes some sense, but other organisations that are not <em>businesses</em> often use exactly the same techniques and tools to examine their data. The <em><strong>B</strong></em> could be some form of generic term to qualify the word <em>intelligence</em>, perhaps in a way to distinguish what we are doing from <em>Military</em> Intelligence, <em>Criminal</em> Intelligence, <em>Market</em> Intelligence or any other such qualifier. But this is a somewhat bogus division, BI (or any of the other intelligences) is not just a single discipline, and definitely not a discipline unique to a single division.</p>
<h4></h4>
<h4>BI is about finding information</h4>
<p>Storing information in a database is without point if you do not have a way to access it again, and access it in a form that suites your business purpose. But the way we need to get at that information varies with purpose:</p>
<ul>
<li>some people have an need to <em><strong>report</strong></em> on the historical, this often requires finding and reading a large amount of data, sorting it and aggregating it</li>
<li>others look at historical data in varying amounts of detail, that is drill up, down and through the data and in so doing may exploit pre-built aggregations or OLAP cubes</li>
<li>then there are those with an interest in the here and now, the operational reports, perhaps from live (transactional) feeds.</li>
<li>finally, in the historical data camp, there are the <em>miners</em> looking for relationships and connections between events.</li>
<li>and others are using the past to predict the future, looking at current events and a knowledge of past patterns to apply probabilities of outcomes.</li>
</ul>
<p>But if  BI is all of the above, how can we build a single physical model that encompasses it all. The answer is that we probably can&#8217;t. To an extent, the needs of co-located data to minimise bulk data reads contraindicates the needs of data mining; partitioning historic data flies in the face of live transactional feeds (it can be done though) and how do we do light-weight speedy predictions in a way that can be used in real-time by an agent using a CRM system?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2007/10/data-centric-bi-trends/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Brighton Werewolves</title>
		<link>http://www.rittmanmead.com/2007/06/brighton-werewolves/</link>
		<comments>http://www.rittmanmead.com/2007/06/brighton-werewolves/#comments</comments>
		<pubDate>Tue, 05 Jun 2007 17:26:35 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Data Warehousing]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2007/06/05/brighton-werewolves/</guid>
		<description><![CDATA[Data miners at Sussex police have noticed a correlation between violent incidents and a full moon and based on this fact deploy more police on the streets of Brighton at times of peak lunar brightness. Well it makes a change from noticing that nappies (diapers) and beer goes together. Talking of Brighton, Jon Mead writes [...]]]></description>
			<content:encoded><![CDATA[<p>Data miners at <a href="http://news.bbc.co.uk/1/hi/england/kent/6723911.stm" target="_blank">Sussex police</a> have noticed a correlation between violent incidents and a full moon and based on this fact deploy more police on the streets of Brighton at times of peak lunar brightness. Well it makes a change from noticing that nappies (diapers) and beer goes together.</p>
<p>Talking of Brighton, Jon Mead <a href="http://www.rittmanmead.com/2007/06/05/data-warehouse-healthchecks-and-owb-training/" target="_blank">writes</a> a few notes on data warehouse healthchecks. I completely agree with his approach &#8211; it is the same as mine :-) The article is worth a look. As Jon points out the exact steps taken in any healthcheck vary &#8211; there is such a variety of design goals and techniques in building warehouses that there can be no such thing as the universal data warehouse diagnostic script. It needs technical experience and a eye for business data analysis.</p>
<p>But at the end of the day we all are trying to answer these four questions</p>
<ol>
<li>Does the data load quickly enough</li>
<li>Does the data load frequently enough</li>
<li>Can the data be read quickly enough</li>
<li>and, can you trust the quality of the data</li>
</ol>
<p>Sounds simple but often needs an expert eye.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2007/06/brighton-werewolves/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Writing it large and reading it big</title>
		<link>http://www.rittmanmead.com/2007/03/writing-it-large-and-reading-it-big/</link>
		<comments>http://www.rittmanmead.com/2007/03/writing-it-large-and-reading-it-big/#comments</comments>
		<pubDate>Sat, 10 Mar 2007 20:19:11 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[BI (General)]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Data Warehousing]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2007/03/10/writing-it-large-and-reading-it-big/</guid>
		<description><![CDATA[Yesterday I ungraciously forgot to mention Nuno Souto&#8217;s blog piece from January where he was talking about just the same sort of issues with access to massive databases. Please go and read Noons&#8217; work, it is well worth it. I mentioned two challenges yesterday, putting the data into the database and finding it again. Systems [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday I ungraciously forgot to mention Nuno Souto&#8217;s blog <a href="http://dbasrus.blogspot.com/2007/01/no-moore-part-2.html" title="Noons">piece</a> from January where he was talking about just the same sort of issues with access to massive databases. Please go and read Noons&#8217; work, it is well worth it.</p>
<p>I mentioned two challenges yesterday, putting the data into the database and finding it again. Systems that can do one of these steps very well often falter with the other process. If you decide to use a very sophisticated tokenized column based database to give you blistering data read access you may find the computing required to work out where to insert the data is too expensive (cost or time) to make the process viable, conversely direct writing of data to disk may give us the best insert rate but something less than optimal for reading. If you index the raw data how much of an overhead is there in maintaining the index in the batch and does that process interfere with concurrent user activity; after all trickle-feed is a now viable dataload strategy? If you don&#8217;t index then how do you find what you are looking for? Not that indexes are always a help; take the example of looking for exception values, it is just not common to index fact measures.</p>
<p>By necessity our data load process is inserting one (or hopefully, many) data records into our data warehouse; that is, we insert complete records. Whether they come from flatfiles, XML files, database links, or whatever they are converted into database rows, maybe cleaned up first, but at the end of the day, inserted into the DW &#8211; we may be doing fancy stuff on the way with transportable tablespaces, partition exchange or change data capture, but the net effect is the same. All of the data passes through our DW and ends up written to disk.</p>
<p>But getting the stuff back is not the same, we are generally interested in a subset of the information stored in the DW, we may be filtering on a group of dimensional attributes and further only looking at one or two of the fact measures stored. We may also aggregate this data down to create a yet smaller result set. But the problem with conventional DW technology is that the whole result set is brought back to the CPU on the data warehouse to be manipulated and as we know this can be slow. The database appliance vendors push out pre-processing of the results to the disk arrays; some vendors using clustered PCs serving single disks, and Netezza uses a field programmable gate array between the disk and backplane, this is approach is particularly interesting in that some very sophisticated query logic can be executed next to the disk. The downside to processing &#8220;at the disk&#8221; is the case where the data interaction is <em>between </em>rows stored on different drives &#8211; as is the case in data mining and other pattern finding applications</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2007/03/writing-it-large-and-reading-it-big/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Thoughts on extremely large databases and searching the unstructured</title>
		<link>http://www.rittmanmead.com/2007/01/thoughts-on-extremely-large-databases-and-searching-the-unstructured/</link>
		<comments>http://www.rittmanmead.com/2007/01/thoughts-on-extremely-large-databases-and-searching-the-unstructured/#comments</comments>
		<pubDate>Mon, 15 Jan 2007 20:37:18 +0000</pubDate>
		<dc:creator>Peter Scott</dc:creator>
				<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Data Warehousing]]></category>
		<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://www.rittmanmead.com/2007/01/15/thoughts-on-extremely-large-databases-and-searching-the-unstructured/</guid>
		<description><![CDATA[Nuno Souto posts an interesting set of thoughts on Extremely Large Databases. As usual, it is a well thought through post from someone who is probably scarred for life from actually working with large databases. In a data warehouse (or even a very large transactional system) context the reader is lead to an inevitability of [...]]]></description>
			<content:encoded><![CDATA[<p>Nuno Souto posts an interesting set of thoughts on <a href="http://dbasrus.blogspot.com/2007/01/no-moore-part-2.html" target="_blank">Extremely Large Databases</a>. As usual, it is a well thought through post from someone who is probably scarred for life from actually working with large databases. In a data warehouse (or even a very large transactional system) context the reader is lead to an inevitability of partitioning (distributing) the data so that techniques such as indexing or table scans can succeed in a usably short time frame.</p>
<p>It can be argued that coupled with sensible partitioning schemes we can minimise IO in a data warehouse (of any large size) by building pre-aggregated summary tables; the pain of the build query happens just once instead of each time the aggregate is needed and by getting the partition granularity right we can minimise aggregations operations to just the partitions that need rebuild.</p>
<p>But what when stray from the domain of of structured DSS type queries? That is away from those queries where (even for ad-hoc queries) we remain able to use our wisely chosen aggregates or exploit the partitioning scheme of the base data. Suppose we look at data mining where the statistical relationships between data items are being determined or we search unstructured data for patterns, relationships or just index its content. And now-a-days unstructured data is not just <em>text</em> (if it ever was) we have speech and speaker recognition, image recognition from the (now) trivial OCR, through fingerprints and facial recognition to the more complex ability search libraries of image components to identify photographic locations. These require inventive indexing techniques but they also require fast access to the underlying data, and for big datasets is that going to be possible?</p>
<p>Maybe ELDBs are not the way to go for unstructured data or data that needs extensive analysis. Perhaps the way to go here is through database federation, keeping the computation close to the disk and coordinating the outputs to produce the end result</p>
]]></content:encoded>
			<wfw:commentRss>http://www.rittmanmead.com/2007/01/thoughts-on-extremely-large-databases-and-searching-the-unstructured/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

