数据科学(影印版)
出版时间:2014年06月
页数:408
现在人们已经意识到数据可以让选举或者商业模式变得不同,数据科学作为一项职业正在不断发展。但是你应该如何在这样一个广阔而又错综复杂的交叉学科领域中开展工作呢?这本书将会告诉你所需要了解的一切。它富有深刻见解,是根据哥伦比亚大学的数据科学课程的讲义整理而成。
在大多数一章长度的讲稿中,来自如Google、Microsoft和eBay这样的公司的数据科学家通过展示案例研究和他们使用的代码分享了新的算法、方法和模型。如果你熟悉线性代数、概率论和统计学并且具备编程经验,那么这本书就是绝佳的数据科学介绍读本。
主题包括:
· 统计推断、探索性数据分析和数据科学处理
· 算法
· 垃圾邮件过滤、朴素贝叶斯和数据转化
· 逻辑回归
· 金融建模
· 推荐引擎和因果关系
· 数据可视化
· 社交网络和数据新闻
· 数据工程、MapReducing、Pregel和Hadoop
Rachel Schutt,新闻集团数据科学高级副总裁,是哥伦比亚大学的统计学兼职教授,也是数据科学和工程学院教育委员会的创始会员。
Cathy O’Neil,Johnson研究实验室的高级数据科学家,具有哈佛大学的数学博士学位,是麻省理工学院数学系的博士后,曾经是巴纳德学院的教授。
- preface
- 1. introduction: what is data science
- big data and data science hype
- getting past the hype
- why now
- datafication
- the current landscape (with a little history)
- data science lobs
- a data science profile
- thought experiment: meta-definition
- ok, so what is a data scientist, really
- in academia
- in industry
- 2. statistical inference, exploratory data analysis, and the data science
- process
- statistic.a1 thinking in the age of big data
- statistical inference
- populations and samples
- populations and samples of big data
- big data can mean big assumptions
- modeling
- exploratory data analysis
- philosophy of exploratory data analysis
- exercise: eda
- the data science process
- a data scientist's role in this process
- thought experiment: how would you simulate chaos
- case study: realdirect
- how does realdirect make money
- exercise: realdirect data strategy
- 3. algorithms
- machine learning algorithms
- three basic algorithms
- linear regression
- k-nearest neighbors (k-nn)
- k-means
- exercise: basic machine learning algorithms
- solutions
- summing it all up
- thought experiment: automated statistician
- 4. spare filters, naive bayes, and wrangling
- thought experiment: learning by example
- why won't linear regression work for filtering spare
- how about k-nearest neighbors
- naive bayes
- bayes law
- a spare filter for individual words
- a spam filter that combines words: naive bayes
- fancy it up: laplace smoothing
- comparing naive bayes to k-nn
- sample code in bash
- scraping the web: apis and other tools
- jake's exercise: naive bayes for article classification
- sample r code for dealing with the nyt api
- 5. logistic regression
- thought experiments
- classifiers
- runtime
- you
- interpretability
- scalability
- m6d logistic regression case study
- chck models
- the underlying math
- 6.1ime stamps and financial modeling
- 7.extracting meaning from data
- 8.recommendation engines:building a user-facing data product at scale
- 9.data visualization and fraud detection
- 10.sociai networks and data journalism
- 11.causality
- 12.epidemiology
- 13.lessons learned from data competitions:data leakage and model evaluation
- 14.data engineering:mapreduce,pregel,and hadoop
- 15.the students speak
- 16.next-generation data scientists,hubris,and ethics
- index
书名:数据科学(影印版)
国内出版社:东南大学出版社
出版时间:2014年06月
页数:408
书号:978-7-5641-4984-0
原版书书名:Doing Data Science
原版书出版商:O'Reilly Media
Rachel Schutt
美国新闻集团旗下数据科学部门高级副总裁、哥伦比亚大学统计系兼职教授、约翰逊实验室高级研究科学家,同时也是哥伦比亚大学数据科学及工程研究所教育委员会的发起人之一。她曾在谷歌研究院工作数年,负责设计算法原型并通过建模理解用户行为。
Cathy O'Neil