I have a script that builds my hadoop classes. Try:
#!/bin/bash
program=`echo $1 | awk -F "." '{print $1}'`
if [ ! -d "${program}_classes" ]
then mkdir ${program}_classes/;
fi
javac -classpath /usr/lib/hadoop/hadoop-common-2.0.0-cdh4.0.1.jar:/usr/lib/hadoop/client/h\
adoop-mapreduce-client-core-2.0.0-cdh4.0.1.jar -d ${program}_classes/ $1
jar -cvf ${program}.jar -C ${program}_classes/ .;
You were probably missing the key jars:
/usr/lib/hadoop/hadoop-common-2.0.0-cdh4.0.1.jar
and
/usr/lib/hadoop/client/hadoop-mapreduce-client-core-2.0.0-cdh4.0.1.jar
By this phrase:
With the combination of StartDateTime
and EndDateTime
, they would be unique.
you mean that they never overlap or that they satisfy the database UNIQUE
constraint?
If the former, then you can use the StartDateTime
in joins, but note that it may be inefficient, since it will use a "<="
condition instead of "="
.
If the latter, then just use a fake identity.
Databases in general do not allow an efficient algorithm for this query:
SELECT *
FROM TransactionState
WHERE @value BETWEEN StartDateTime AND EndDateTime
, unless you do arcane tricks with SPATIAL
data.
That's why you'll have to use this condition in a JOIN
:
SELECT *
FROM factTable
CROSS APPLY
(
SELECT TOP 1 *
FROM TransactionState
WHERE StartDateTime <= factDateTime
ORDER BY
StartDateTime DESC
)
, which will deprive the optimizer of possibility to use HASH JOIN
, which is most efficient for such queries in many cases.
See this article for more details on this approach:
Rewriting the query so that it can use HASH JOIN
resulted in 600%
times performance gain, though it's only possible if your datetimes have accuracy of a day or lower (or a hash table will grow very large).
Since your time component is stripped of your StartDateTime
and EndDateTime
, you can create a CTE
like this:
WITH cal AS
(
SELECT CAST('2009-01-01' AS DATE) AS cdate
UNION ALL
SELECT DATEADD(day, 1, cdate)
FROM cal
WHERE cdate <= '2009-03-01'
),
state AS
(
SELECT cdate, ts.*
FROM cal
CROSS APPLY
(
SELECT TOP 1 *
FROM TransactionState
WHERE StartDateTime <= cdate
ORDER BY
StartDateTime DESC
) ts
WHERE ts.EndDateTime >= cdate
)
SELECT *
FROM factTable
JOIN state
ON cdate = DATE(factDate)
If your date ranges span more than 100
dates, adjust MAXRECURSION
option on CTE
.
Best Answer
Add hadoop-annotations-2.0.0-cdh4.0.1.jar to the classpath