11 Oct 2024
The code presents data sharing as a matter of norms: researchers have an ethical obligation to allow colleagues to verify results and, if possible, to make their dataset broadly available after they have finished with it
In 2006, a group of Dutch psychologists sought to obtain data for all empirical studies published in two issues of four major psychology journals
Data are confidential when researchers have information about the identity of participants but agree to keep identities private
Data are anonymous when researchers themselves do not know respondents’ identities
Direct identifiers are variables that obviously identify individuals, including names, email addresses, and social security numbers. Of course, these must be removed in order for the resulting dataset to be anonymous. But just because direct identifiers have been removed does not mean the data are truly anonymous
For instance, Netflix once sponsored a contest in which data scientists were given information about how users had rated some films and asked to predict how those users would rate others. The dataset included only seemingly minimal information: a random ID number for individuals, the film, the date of the rating, and the number of stars given. Narayanan and Shmatikov (2008) demonstrated that one could use correspondences in the dates of reviews between the Netflix data and the IMDB site to match some IDs in the Netflix data to IMDB usernames, and some IMDB usernames could then be easily connected to people’s real identities
Researchers have often made data public by simply posting a file on their personal websites, but this is far from best practice nowadays. Personal websites come and go, whereas data archivists’ jobs require them to think in terms of posterity
A clear and thoughtful description of what should go in a data replication package, for instance, has been provided by the American Journal of Political Science (AJPS; Jacoby and Lupton 2016). Their guidelines break the work of assembling a replication package into four components: README file; analysis datasets; software commands; and information to reconstruct analysis dataset
The README file walks the others through the contents of the replication package and how they relate to the contents of the project. We think this is most effectively organised by presenting the data files to be used first, and then referring to each table, figure, or result in the paper and indicating the command file (or header within a larger command file) used to generate each one
A README is often the first item a visitor will see when visiting your repository. README files typically include information on:
You can add a README file to a repository to communicate important information about your project. A README, along with a repository license, citation file, contribution guidelines, and a code of conduct, communicates expectations for your project and helps you manage contributions
For more information about providing guidelines for your project, see “Adding a code of conduct to your project” and “Setting up your project for healthy contributions”
If a repository contains more than one README file, then the file shown is chosen from locations in the following order: the .github directory, then the repository’s root directory, and finally the docs directory
Start your script with the packages you need. That way, if you share your code with others, they can easily see which packages they need to install
Note that you should never include install.packages()
in a script you share
If you have multiple packages to install, then please consider using the pacman
package
library(pkg)
with groundhog.library(pkg,date)
groundhog.library()
loads packages & their dependencies as available on chosen date on CRANComputational reproduciblity = code + data + environment + distribution
Roger Peng’s checklist
File names should be machine readable: avoid spaces, symbols, and special characters. Don’t rely on case sensitivity to distinguish files
File names should be human readable: use file names to describe what’s in the file
File names should play well with default ordering: start file names with numbers so that alphabetical sorting puts them in the order they get used
01-load-data.R
02-exploratory-analysis.R
03-model-approach-1.R
04-model-approach-2.R
fig-01.png
fig-02.png
report-2022-03-20.qmd
report-2022-04-02.qmd
report-draft-notes.txt
Step 1: Environment is part of your project. If someone cannot reproduce your environment, they won’t be able to run your code
Step 2: For each project, create a project directory named after the project
Step 3: Launch R Studio. Choose File > New project > Browse existing directories > Create project. This allows each project has its workspace
Imagine you did some work, committed the changes, and pushed them to the remote repo. But you’d like to undo those changes
Say you added some plain text by mistake to README.md. Running git revert
will do the opposite of what you just did (i.e., remove the plain text) and create a new commit. You can then git push
this to the remote.
SOCS0100 – Computational Social Science