我正在寻找一些大型公共数据集,特别是:
已匿名化的大型示例Web服务器日志.
用于数据库性能基准测试的数据集.
任何其他指向大型公共数据集的链接都将受到赞赏.我已经了解亚马逊的公共数据集:http: //aws.amazon.com/publicdatasets/
1.已匿名化的大型示例Web服务器日志.
这些工作开始于:
UCI机器学习库
匿名Microsoft Web数据
MSNBC.com匿名网络数据
Syskill和Webert网页评级
有比这些更多的数据集(参见其他答案的全部内容),但这是符合您原始标准的最低水果.作为奖励,如果您有他们可能知道的特定需求,他们会有联系链接.
2.用于数据库性能基准测试的数据集.
这听起来有点用词不当,因为您要求的经验数据集描述明确定义的 算法 问题.具体来说,听起来您正在尝试使用明确定义的规范化关系数据来查找可用于实时测试和基准测试各种数据库系统的数据集,这些数据可用作一组测试用例来确定最有效的解决方案,满足您的需求.
我不同意这种做法.而不是找到一连串的数据库系统及其固定的实现,最好将这些系统的算法 保证作为您的第一个调用端口.一旦确定了满足您需求的算法约束,您就可以研究一组固定解决方案,您可以对其效率进行基准测试,例如索引,排序,搜索,插入,删除和检索.
Wikipedia提供了一篇关于数据库测试概念的简明文章,您可以使用它来确定和编写用于基准测试性能的测试用例.例如,您可以使用不可知的数据访问接口(如JDBC和JDBC Benchmark)来确定每个操作的相对时间.从这里,您可以磨练正确的解决方案.
总之,去研究首先确定数据库的保证.一旦确定了一组候选解决方案,您可以通过测试(或以其他方式确定)每个所需操作的恒定时间性能来选择这些解决方案.
Based on Quora answers and my personal collections in my studies, an awesome-public-datasets repository was created and updated lively on GitHub:
Below is a snapshot version of this list. For a newest list, please visit Github:
This list of public data sources are collected and tidied from blogs, answers, and user responses. Most of the data sets listed below are free, however, some are not. This list comes from https://github.com/caesar0301/awesome-public-datasets.
Australian Weather: http://www.bom.gov.au/climate/dwo/
Climate data: http://www.cru.uea.ac.uk/cru/data/temperature/#datter and ftp://ftp.cmdl.noaa.gov/
自1929年以来的全球气候数据:http://www.tutiempo.net/en/Climate
NOAA白令海气候:http://www.beringclimate.noaa.gov/
NOAA气候数据集:http://ncdc.noaa.gov/data-access/quick-links
WU全球历史天气:http://www.wunderground.com/history/index.html
美国经济屁股.(AEA):http://www.aeaweb.org/RFE/toc.php?show = complete
EconData(UMD):http://inforumweb.umd.edu/econdata/econdata.html
互联网产品代码数据库:http://www.upcdatabase.com/
世界银行:http://data.worldbank.org/indicator
CBOE期货交易所:http://cfe.cboe.com/Data/
Google财经:https://www.google.com/finance
Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0
NASDAQ: https://data.nasdaq.com/
OANDA: http://www.oanda.com/
OSU Financial data: http://fisher.osu.edu/fin/osudata.htm
Quandl: http://www.quandl.com/
St Louis Federal: http://research.stlouisfed.org/fred2/
Yahoo Finance: http://finance.yahoo.com/
CRCNS: http://crcns.org/data-sets
Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/
Human Microbiome Project: http://www.hmpdacc.org/reference_genomes/reference_genomes.php
MIT Cancer Genomics Data: http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
NIH Microarray data: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE6532/
Protein structure: http://www.infobiotic.net/PSPbenchmarks/
Public Gene Data: http://www.pubgene.org/
Stanford Microarray Data: http://smd.stanford.edu/
UniGene: http://www.ncbi.nlm.nih.gov/unigene
NASA: http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html
EHDP Large Health Data Sets: http://www.ehdp.com/vitalnet/datasets.htm
Gapminder: http://www.gapminder.org/data/
Medicare Data File: http://go.cms.gov/19xxPN4
EOSDIS: http://sedac.ciesin.columbia.edu/data/sets/browse
Factual Global Location Data: http://www.factual.com/
Geo Spatial Data: http://geodacenter.asu.edu/datalist/
Airlines Data (2009 ASA Challenge): http://stat-computing.org/dataexpo/2009/the-data.html
Airports and their locations: http://www.infochimps.com/datasets/airports-and-their-locations
Bike Share Data Systems: https://github.com/BetaNYC/Bike-Share-Data-Best-Practices/wiki/Bike-Share-Data-Systems
Edge data for US domestic flights 1990 to 2009: http://data.memect.com/?p=229
Half a million Hubway rides: http://hubwaydatachallenge.org/trip-history-data/
NYC Taxi Trip Data 2013 (FOIA/FOIL): https://archive.org/details/nycTaxiTripData2013
OpenFlights (airport, airline and route data): http://openflights.org/data.html
RITA Airline On-Time Performance Data: http://www.transtats.bts.gov/Tables.asp?DB_ID=120
RITA transport data collection: http://www.transtats.bts.gov/DataIndex.asp
Transport for London: http://www.tfl.gov.uk/info-for/open-data-users/our-feeds
U.S. Freight Analysis Framework: http://ops.fhwa.dot.gov/freight/freight_analysis/faf/index.htm
Archive-it: : https://www.archive-it.org/explore?show=Collections
Australia: http://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3301.02009?OpenDocument
Canada: http://www.data.gc.ca/default.asp?lang=En&n=5BCD274E-1
Chicago: https://data.cityofchicago.org/
FDA: https://open.fda.gov/index.html
Fed Stats: http://www.fedstats.gov/cgi-bin/A2Z.cgi
Guardian world governments: http://www.guardian.co.uk/world-government-data
HUD: http://www.huduser.org/portal/datasets/pdrdatas.html
London Datastore, U.K: http://data.london.gov.uk/dataset
New Zealand: http://www.stats.govt.nz/browse_for_stats.aspx
NYC betanyc: http://betanyc.us/
NYC Open Data: http://nycplatform.socrata.com/
OECD: http://www.oecd.org/document/0,3746,en_2649_201185_46462759_1_1_1_1,00.html
RITA: http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
San Francisco Data sets: http://datasf.org/
The World Bank: http://wdronline.worldbank.org/
U.K. Government Data: http://data.gov.uk/data
U.S. Census Bureau: http://www.census.gov/data.html
U.S. Federal Government Agencies: http://www.data.gov/metric
U.S. Federal Government Data Catalog: http://catalog.data.gov/dataset
U.S. Open Government: http://www.data.gov/open-gov/
UK 2011 Census Open Atlas Project: http://www.alex-singleton.com/2011-census-open-atlas-project/
United Nations: http://data.un.org/
US CDC Public Health datasets: http://www.cdc.gov/nchs/data_access/ftp_data.htm
Challenges in Machine Learning: http://www.chalearn.org/
ICWSM Data Challenge (since 2009): http://icwsm.cs.umbc.edu/
Kaggle Competition Data: http://www.kaggle.com/
KDD Cup by Tencent 2012: https://www.kddcup2012.org/
Netflix Prize: http://www.netflixprize.com/leaderboard
Yelp Dataset Challenge: http://www.yelp.com/dataset_challenge
eBay Online Auctions: http://www.modelingonlineauctions.com/datasets
IMDb database: http://www.imdb.com/interfaces
Keel Repository: http://sci2s.ugr.es/keel/datasets.php
Lending Club Loan Data: https://www.lendingclub.com/info/download-data.action
Machine Learning Data Set Repository: http://mldata.org/
Million Song Dataset: http://blog.echonest.com/post/3639160982/million-song-dataset
More Song Datasets: http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets
MovieLens Data Sets: http://datahub.io/dataset/movielens
RDataMining R and Data Mining ebook data: http://www.rdatamining.com/data
Registered meteorites on Earth: http://www.analyticbridge.com/profiles/blogs/registered-meteorites-that-has-impacted-on-earth-visualized
SF restaurants dataset: http://missionlocal.org/san-francisco-restaurant-health-inspections/
UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/
University of Toronto Delve Datasets: http://www.cs.toronto.edu/~delve/data/datasets.html
Yahoo Ratings and Classification Data: http://webscope.sandbox.yahoo.com/catalog.php?datatype=r
40 Million Entities in Context: https://code.google.com/p/wiki-links/downloads/list
ClueWeb09 FACC: http://lemurproject.org/clueweb09/FACC1/
ClueWeb12 FACC: http://lemurproject.org/clueweb12/FACC1/
Flickr personal taxonomies: http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html
Google Books Ngrams: http://aws.amazon.com/datasets/8172056142375670
Google Web 5gram, 2006 (1T): https://catalog.ldc.upenn.edu/LDC2006T13
Gutenberg eBooks List: http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs
Hansards: http://www.isi.edu/natural-language/download/hansard/
Machine Translation: http://statmt.org/wmt11/translation-task.html#download
SMS Spam Collection: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
USENET corpus: http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
WordNet: http://wordnet.princeton.edu/wordnet/download/
2GB of photos of cats: http://bit.do/UJZZ
Face Recognition Benchmark: http://www.face-rec.org/databases/
ImageNet: http://www.image-net.org/
Time Series data Library: https://datamarket.com/data/list/?q=provider:tsdl
UC Riverside Time Series: http://www.cs.ucr.edu/~eamonn/time_series_data/
China Hotel Checkin/out data: http://www.360doc.com/content/13/1105/13/7863900_326788919.shtml
CMU Enron Email: http://www.cs.cmu.edu/~enron/
Facebook Social Networks (since 2007): http://law.di.unimi.it/datasets.php
Facebook100 (2005): https://archive.org/details/oxford-2005-facebook-matrix
Foursquare (2010,2011): http://www.public.asu.edu/~hgao16/dataset.html
Foursquare (UMN/Sarwat, 2013): https://archive.org/details/201309_foursquare_dataset_umn
General Social Survey (GSS): http://www3.norc.org/GSS+Website/
GetGlue (users rating TV shows): http://getglue-data.s3.amazonaws.com/getglue_sample.tar.gz
GitHub Archive: http://www.githubarchive.org/
ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp
Mobile Social Networks (UMASS): https://kdl.cs.umass.edu/display/public/Mobile+Social+Networks
PewResearch Internet Project: http://www.pewinternet.org/datasets/pages/2/
Social Networking: http://www.cs.cmu.edu/~jelsas/data/ancestry.com/
SourceForge Graph: http://www.nd.edu/~oss/Data/data.html
Titanic Survival Data Set: https://github.com/caesar0301/awesome-public-datasets/blob/master/Datasets/titanic.csv.zip
Twitter Graph: http://an.kaist.ac.kr/traces/WWW2010.html
UC Berkeley's D-Lab Achive: http://ucdata.berkeley.edu/
UCLA Social Sciences Data Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm
UNIMI Social Network Datasets: http://law.di.unimi.it/datasets.php
Universities Worldwide: http://univ.cc/
UPJOHN for Employment Research: http://www.upjohn.org/erdc/erdc.html
Yahoo Graph and Social Data: http://webscope.sandbox.yahoo.com/catalog.php?datatype=g
Youtube Graph (2007,2008): http://netsg.cs.sfu.ca/youtubedata/
CrossRef DOI URLs: https://archive.org/details/doi-urls
DBLP Citation dataset: https://kdl.cs.umass.edu/display/public/DBLP
NBER Patent Citations: http://nber.org/patents/
NIST complex networks data collection: http://math.nist.gov/~RPozo/complex_datasets.html
Protein-protein interaction network: http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm
PyPI and Maven Dependency Network: http://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/
3> Gene De Lisa..:
这是几个.玩得开心.
http://archive.ics.uci.edu/ml/
http://aws.amazon.com/datasets?_encoding=UTF8&jiveRedirect=1
http://crawdad.org/
http://data.austintexas.gov
http://data.cityofchicago.org
http://data.govloop.com
http://data.gov.uk/
http://data.medicare.gov
http://data.seattle.gov
http://data.sfgov.org
http://data.sunlightlabs.com
https://datamarket.azure.com/
http://ftp.ncbi.nih.gov/
http://gettingpastgo.socrata.com
http://books.google.com/ngrams/
http://linkeddata.org/
http://medihal.archives-ouvertes.fr
http://public.resource.org/
http://rechercheisidore.fr
http://reddit.com/r/datasets
http://timetric.com/public-data/
http://www2.jpl.nasa.gov/srtm
http://www.bls.gov/
http://www.crunchbase.com/
http://www.dartmouthatlas.org/
http://www.data.gov/
http://www.datakc.org
http://www.factual.com/
http://www.freebase.com/
http://www.infochimps.com
http://www.kaggle.com/
http://build.kiva.org/
http://www.imdb.com/interfaces
http://dbpedia.org
只是一个想法:
USGS地理名称数据库
美国农业部植物清单
许多州的GIS存储库中的任何一个,例如NH的GRANIT