Cloud era - Hello, Pro Hadoop
1. Map Reduce Mode

2. Web Crawler Samples

4. Map Reduce Framework Need to Know

2. Web Crawler Samples
- Ingest the URLs and their associated metadata.
- Normalize the URLs.
- Eliminate duplicate URLs.
- Filter the URLs against a set of exclusion and inclusion filters.
- Filter the URLs against a do not fetch list.
- Filter the URLs against a recently seen set.
- Fetch the URLs.
- Fingerprint the content items.
- Update the recently seen set.
- Prepare the work list for the next application.

4. Map Reduce Framework Need to Know
- The location(s) in the distributed file system of the job input
- The location(s) in the distributed file system for the job output
- The input format
- The output format
- The class containing the map function
- Optionally. the class containing the reduce function
- The JAR file(s) containing the map and reduce functions and any support classes


0 条评论:
发表评论
订阅 帖子评论 [Atom]
<< 主页