Wednesday, December 15, 2010
Documentation on its way
Saturday, November 20, 2010
Paper harvester and metadata extractor now available on github
Monday, November 15, 2010
Experimental code distribution now available
[Update: I should say that this is really not meant for production use yet, and the doc is basically missing.]
Tuesday, November 9, 2010
How to make a web site re-usable by third parties
Wednesday, August 18, 2010
Bulk ingestion protocols, which is best?
Tuesday, August 17, 2010
Selecting only English-language material when harvesting OAI metadata
We're now harvesting thousands of archives for PhilPapers, as described in my earlier post. But we've stumbled on a new problem which I thought I should report on here.
We only want English-language material on PhilPapers, but a lot of archives won't return language data, or will say that an item is in English when it's not (presumably because it's the default and users don't bother to change it.) This is a serious obstacle to the automatic aggregation of metadata from OAI archives if you don't want your aggregation to be swamped by material your average user will consider pure noise.
Our solution to this problem has three components. First, we weed out archives which don't declare that they have English-language content on OpenDOAR. So we attempt to monitor an archive only if it says that it has material in English among other languages.
Second, we've found that language attributes tend to be truthful at least when they say that an item is not in English, so we weed out anything that is declared as not being in English.
Finally, we apply an automatic language detection test to the rest of the material. This is where it gets tricky.
We originally tried the Language::Guess class on CPAN, but it's not reliable enough.
We've then tried simply checking what percentage of words of an item's title and description are in the standard English dictionary that comes with aspell (the unix program), but there are so many neologisms in philosophy that this excluded many English-language papers.
The final solution is to use aspell in this way, but with an enriched dictionary we compute based on our existing content. Currently we add a word to our dictionary of 'neologisms' just in case it occurs in 10 or more PhilPapers entries which past a strict English-only test. The strict test is to have less than 7% of words not in the standard English dictionary. We need this test because a number of non-English papers have made it into PhilPapers already..
We use aspell because it's supposed to be good at recognizing inflections and the like, and it works well also to provide spelling suggestions (more on this in a later post). However, a note of caution about aspell: all characters in a custom dictionary have to be in the same unicode block, which means they can't contain, say, both French and Polish words with special characters specific to these languages. (This seems like a bug, because the doc only talks about a same-script limitation.) Our solution is to remove diacritics from everything we put in the dictionary. That works for our purposes but could obviously be a major limitation.
Friday, June 25, 2010
Implementing file upload progress bar for the new PhilPapers
First of all the common programming tools (like CGI.pm or Mason that we use here) assume that the page handler receives the whole request as input - and that whole request is not available until after the file is uploaded. So for example 'my $q = CGI.pm->new' will not finish until it is too late to measure the upload progress. The solution to that is to use another page to report the upload progress and call that page via Ajax from Javascript code updating the progress bar. This would work great - but the file is normally uploaded to a temporary file with a random name and the other script would not have any chance to guess it. We need to generate a new random file name in the form page and then pass that name to the form handler script so that it would save the data to that file, and in parallel to the Ajax scripts that would check the size of that file.
To save the data into a specified filename I used the CGI.pm callback feature:
my $q = CGI->new( \&hook, $fh, undef );
...
sub hook {
my ($filename, $buffer, $bytes_read, $fh) = @_;
print $fh substr($buffer, 0, $bytes_read);
$fh->flush();
}
It is described in the subsection called "Progress bars for file uploads and avoiding temp files" of the CGI.pm documentaion, but actually it is a great leap of thought to say that it supports progress bar implementation, you still cannot use it directly to get the progress bar from the CGI object on the form landing page, you still need the separate scripts measuring the progress. For my solution all I needed was to pass the target file name to the code saving the data, this could be easier than writing this callback above. And the callback is still not everything - I yet need a way to pass the generated filename from the form page to that script - and not via form parameters, remember they are not available at that stage. So how can that be done? Simple - as PATH_INFO - which is available in the %ENV hash even before the params are parsed by CGI.pm.
This is the skeleton of the solution - there are a few more details in the actual implementation - but the code will be published soon as Open Source - so I hope everyone will be able to look them up there.
Thursday, April 8, 2010
Syndicating content from institutional repositories
- Google "dispositional properties".
- Search Google Scholar, Web of Science, or some other generic research index for "dispositional properties".
- Search a relevant subject repository for "dispositional properties". In this case PhilPapers would serve you well. Try searching for "dispositional properties" on PP. Not only do you get tons of highly relevant content, but you get a link to a bibliography on dispositional properties maintained by an expert on the subject.
- Ask content producers (academics) to submit their metadata to SRs.
- Harvest all content from all IRs, and filter out irrelevant content based on keywords or more advanced document classification techniques.
- Crowd-source a list of IRs and OAI sets relevant to your subject.
A definition of "subject repository"
A subject repository is a repository of research outputs (and possibly metadata about such outputs) whose primary mission is to give end users access to all and only the research content available in a given subject.
Thursday, April 1, 2010
Automatic categorization of citations using Perl
Sunday, March 28, 2010
Facilitating access to subscription-based resources -- Athens, Shibboleth, OpenURL, Reverse Proxy, etc
Friday, March 26, 2010
A new backup script with rsync, versioning and rotation
- Use rsync or equivalent for transfer to speed things up
- Keep rotating versions of backups
- Don't duplicate files unless needed on the backup host (using hard links)
- Can be configured from a text file
The script should run on any *nix machine with rsync. It should work with any *nix backup host as well, but you will need shell access to the backup host. You will also need to configure the user running the script for password-less login.