E-Science can help social scientists extract relevant and timely conclusions from masses of disparate data sources. Judy Redfearn outlines the process
Social scientists often need data from several sources to answer complex questions. Pulling out the relevant attributes from each dataset and uniting them for analysis is laborious and time consuming using traditional methods. The data may have been collected by disparate organisations for separate purposes and stored in different formats in far-flung locations. Now, however, two projects funded by the Economic and Social Research Council have demonstrated the potential of e-science techniques to speed the process and offer insights into vital social science research questions.
One project has illuminated the long-standing debate over the extent to which neighbourhood influences an individual's likelihood of committing a crime compared with other factors such as family, school and personal choice. Neighbourhood accounts for 80 per cent of the risk, the project found. Another discovery was that proximity to a criminal's home accounts for 70 per cent of an individual's risk of becoming a victim of crime.
The research team at Sheffield University obtained data from a number of sources about the location of crimes, the location of offenders' and victims' homes and a variety of social and economic measures over postcode-sized areas in South Yorkshire. From these data, they developed a model that predicted the distribution of offenders hectare by hectare. They then tested the model for all of England using the White Rose Grid, which pools computing resources at the universities of Leeds, Sheffield and York.
"We needed the Grid to do the sums for 35 million cells over England," says Max Craglia, who worked on the project at Sheffield. "It allowed us to run the model and analyse the patterns. We were able to do in one hour what would have taken 40 hours on one computer."
The other project also used Grid computing to explore an important social issue, the distribution of poverty in different ethnic minority groups.
Detailed surveys provide data on household income, but they are too small to provide meaningful information on ethnic minority incomes. Large-scale surveys such as the census record data on the geographical spread of ethnic minorities, but not on incomes. Simon Peters and colleagues at Manchester University used the British Household Panel Survey (BHPS), a detailed survey of a small sample of the UK population, to derive a model of individual income. They applied this to 1991 Census data to reveal insights into the distribution of poverty in different ethnic minority groups in different geographical areas. They were unable to use the 2001 Census due to confidentiality restrictions.
First, they "Grid-enabled" the BHPS and census data and stored it on a data node of the National Grid Service. This eased extraction and integration of the required data. Then they ran the computational analysis using resources on the NGS and displayed the results on maps of the UK that showed the geographical distribution of measures of poverty by ethnic group.
"Our findings were in line with other researchers' work since 1991. This showed that our methodology works," Peters says. Although poverty rates in ethnic minorities are higher than for the rest of the population, as expected, the project found considerable variation across geographical areas and different ethnic minority groups. Indians, for example, have poverty rates close to those of whites, whereas for other groups, such as Pakistanis and Bangladeshis, poverty rates are particularly high in certain regions such as the North of England.
Both projects showed that e-Science can speed complex analyses in social science research and yield interesting results. However, more development is needed before e-Science techniques can be applied routinely by social scientists with little computing experience.