Big Data and Hadoop



What is big Data?

Big data is used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques. Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data

Why Big Data?

Data growth is huge and all those data is valuable to make critical decisions. Now days, the disk is cheap that we could store the data. But the amount of data is so huge that it won’t fit in a single computer. So we need to have it distributed it across. With the distributed data we will be able to perform parallel operations and thus faster computation. This is the trick behind Hadoop.
Big Data Challenges

  1. Velocity - Lot of data coming at a great speed.
  2.  Volume – Large volume of data is collected and is growing exponentially.
  3. Variety – Data of different varieties gets collected in Hadoop. Data is not organized like we see in relational database. Data may be in the form audio, video, image, files, log files etc.

What is Hadoop?

Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. Hadoop is not a single software; instead it is a framework of tools and is distributed under apache license. Essentially, it accomplishes two tasks: massive data storage and faster processing.

Traditional Data Storage approach vs Hadoop Storage

Traditionally data is stored in a single computer and the operation on the data will be performed within that. Computer could process the data only up to a threshold data amount. This is a limitation with the traditional data storage approach. Hadoop takes a different approach compared to the traditional data storage. Hadoop breaks the data as well as computation into smaller pieces and thus handling the big data storage and its processing.

Improve the Functionality of Exporting Data from OBIEE 11.1.1.7.1 into Excel



To improve the functionality of exporting data from analyses, dashboards, and other Oracle BI Presentation Catalog objects into Microsoft Excel, edit settings in OracleBIJavahostComponent
  1. Navigate to the /instances/instance1/config/OracleBIJavaHostComponent folder.
  2. Perform the following actions:
    1. In the config.xml file, configure the XMLP tag for large data as follows:

<XMLP> <InputStreamLimitInKB>32768</InputStreamLimitInKB>

Note: Setting InputStreamLimitInKB governor value to zero (0), which is unlimited, should only be used for testing purposes only. Configuring the value too high will allocate or consume more resources than necessary for an individual request to the JavaHost, may cause the Javahost to become unstable or crash and should be configured in context of all JavaHost requests (charts, graphs, exports).  Set the value to something reasonable that works with your large data sets.  The default is 8192 (8MB), but you may need to increase it to 16384 (16MB), 32768 (32MB), etc. (1024 * X).

<ReadRequestBeforeProcessing>false</ReadRequestBeforeProcessing></XMLP>
Note: If your organization uses the export feature in Oracle BIEE, it is recommended that you set this subelement to false. When set to false, data is streamed to JavaHost gradually rather than saved to a file first and then processed, thereby improving export performance.

    1. In the xdo.cfg file, change the setting for xlsx-keep-values-in-same-column to True.
Note: If the entry does not exist, then you can add it in the following format:  <property name="xlsx-keep-values-in-same-column">true</property>

add this between the <properties>  and </properties> tags

  1. Restart the WebLogic Administration Server and the Managed Server  and Oracle BI system components

CEF_23005 : Failed to create thread due to operating system error [Resource temporarily unavailable]



Sometimes when you trigger ETL via DAC, ETL fails or hung with the following error message

Informatica Session Log

CEF_23005 : Failed to create thread due to operating system error [Resource temporarily unavailable].

OR

DAC Log

ANOMALY INFO::: Error while executing : com.siebel.analytics.etl.etltask.ParallelTaskBatch:Create Index Batch
MESSAGE:::java.lang.OutOfMemoryError: unable to create new native thread
EXCEPTION CLASS::: java.lang.Exception


Solution

Increase the value of below mentioned parameter in Informatica server box.

max user processes (-u)

Scheduler tables in OBIEE 11g



Oracle BI Scheduler is a server that manages and schedules jobs. When a user creates and schedules an agent, Oracle BI Presentation Services gathers information about the agent such as its priority, the intended recipients, and the devices to which content should be delivered. Presentation Services packages this information and other characteristics into a job, then informs Oracle BI Scheduler when to execute the job.

Agents can run in parallel on different threads. The number of agents that can run in parallel depends on the size of the Scheduler thread pool (a configurable setting) and the number of threads used up by each agent. Queueing might occur if too many agents are triggered at the same time.

Oracle BI Scheduler uses a single back-end database to store pertinent information about a job, its instances, and its parameters.

The details about the scheduler tables are mentioned below :

S_NQ_JOB - This table is used by Scheduler to store information about scheduled jobs. That means when you create a new agent from OBIEE, an entry corresponding to it get created in S_NQ_JOB table. This table is stored in the BI_PLATFORM schema, therefore to access it; you should have access to BI_PLATFORM schema.

Some of the columns in the table are mentioned below 

Column Name
Description
JOB_ID
It is unique identifier for each agent
NAME
Name of Agent
NEXT_RUN_TIME_TS
Next scheduled runtime of agent
LAST_RUN_TIME_TS
Last runtime of the agent

S_NQ_INSTANCE – This table stores information about scheduled job instances. For a job in the S_NQ_JOB table, there will be multiple entries in the S_NQ_INSTANCE table based on the agent run.

Column Name
Description
JOB_ID
Identifier for the job populated from S_NQ_JOB
INSTANCE_ID
Unique identifier for each instance
STATUS
Shows the status of the agent
0 - Completed
1 - Running
2 - Failed
3 - Cancelled
5 - Timed out
BEGIN_TS
Start of the instance
END_TS
End of instance
EXIT_CODE
Number of e-mails sent by an Agent job, after the job is completed

Relation with  S_NQ_JOB
S_NQ_JOB.JOB_ID = S_NQ_INSTANCE.JOB_ID

S_NQ_ERR_MSG - This table stores error messages for Scheduler job instances that do not complete successfully. 

Column Name
Description
JOB_ID
Same as in S_NQ_JOB & S_NQ_INSTANCE
INSTANCE_ID
Same as S_NQ_INSTANCE
ERROR_MSG_TXT
Displays the error message in agent failure. This message is same as the message in Agent.log file.

Relation with  S_NQ_INSTANCE
S_NQ_INSTANCE.JOB_ID  =  S_NQ_ERR_MSG.JOB_ID
AND  S_NQ_INSTANCE.INSTANCE_ID  =  S_NQ_ERR_MSG.INSTANCE_ID
 
S_NQ_JOB_PARAM - This table holds information about Scheduler job parameters for scheduled jobs.
Relation with  S_NQ_JOB
S_NQ_JOB.JOB_ID = S_NQ_JOB_PARAM.JOB_ID