Tag Archives: data mining

Who Controls Peoples’ Data?

The McKinsey Global Institute estimates that cross-border flows of goods, services and data added 10 per cent to global gross domestic product in the decade to 2015, with data providing a third of that increase. That share of the contribution seems likely to rise: conventional trade has slowed sharply, while digital flows have surged. Yet as the whole economy becomes more information-intensive — even heavy industries such as oil and gas are becoming data-driven — the cost of blocking those flows increases…

Yet that is precisely what is happening. Governments have sharply increased “data localisation” measures requiring information to be held in servers inside individual countries. The European Centre for International Political Economy, a think-tank, calculates that in the decade to 2016, the number of significant data localisation measures in the world’s large economies nearly tripled from 31 to 84.

Even in advanced economies, exporting data on individuals is heavily restricted because of privacy concerns, which have been highlighted by the Facebook/ Cambridge Analytica scandal. Many EU countries have curbs on moving personal data even to other member states. Studies for the Global Commission on Internet Governance, an independent research project, estimates that current constraints — such as restrictions on moving data on banking, gambling and tax records — reduces EU GDP by half a per cent.

In China, the champion data localiser, restrictions are even more severe. As well as long-established controls over technology transfer and state surveillance of the population, such measures form part of its interventionist “ Made in China 2025 ” industrial strategy, designed to make it a world leader in tech-heavy sectors such as artificial intelligence and robotics.

China’s Great Firewall has long blocked most foreign web applications, and a cyber security law passed in 2016 also imposed rules against exporting personal information, forcing companies including Apple and LinkedIn to hold information on Chinese users on local servers. Beijing has also given itself a variety of powers to block the export of “important data” on grounds of reducing vaguely defined economic, scientific or technological risks to national security or the public interest.   “The likelihood that any company operating in China will find itself in a legal blind spot where it can freely transfer commercial or business data outside the country is less than 1 per cent,” says ECIPE director Hosuk Lee-Makiyama….

Other emerging markets, such as Russia, India, Indonesia and Vietnam, are also leading data localisers. Russia has blocked LinkedIn from operating there after it refused to transfer data on Russian users to local servers.

Business organisations including the US Chamber of Commerce want rules to restrain what they call “digital protectionism”. But data trade experts point to a serious hole in global governance, with a coherent approach prevented by different philosophies between the big trading powers. Susan Aaronson, a trade academic at George Washington University in Washington, DC, says: “There are currently three powers — the EU, the US and China — in the process of creating separate data realms.”

The most obvious way to protect international flows of data is in trade deals — whether multilateral, regional or bilateral. Yet only the World Trade Organization laws governing data flows predate the internet and have not been thoroughly tested through litigation. It recently recruited Alibaba co-founder Jack Ma to front an ecommerce initiative, but officials involved admit it is unlikely to produce anything concrete for a long time. In any case, Prof Aaronson says: “While data has traditionally been addressed in trade deals as an ecommerce issue, it goes far wider than that.”

The internet has always been regarded by pioneers and campaigners as a decentralised, self-regulating community. Activists have tended to regard government intervention with suspicion, except for its role in protecting personal data, and many are wary of legislation to enable data flows.  “While we support the approach of preventing data localisation, we need to balance that against other rights such as data protection, cyber security and consumer rights,” says Jeremy Malcolm, senior global policy analyst at the Electronic Frontier Foundation, a campaign for internet freedom…

Europe has traditionally had a very different philosophy towards data and privacy than the US. In Germany, for instance, public opinion tends to support strict privacy laws — usually attributed to lingering memories of surveillance by the Stasi secret police in East Germany. The EU’s new General Data Protection Regulation (GDPR), which comes into force on May 25, 2018 imposes a long list of requirements on companies processing personal data on pain of fines that could total as much as 4 per cent of annual turnover….But trade experts warn that the GDPR is very cautiously written, with a blanket exemption for measures claiming to protect privacy. Mr Lee-Makiyama says: “The EU text will essentially provide no meaningful restriction on countries wanting to practice data localisation.”

Against this political backdrop, the prospects for broad and binding international rules on data flow are dim. …In the battle for dominance over setting rules for commerce, the EU and US often adopt contrasting approaches.  While the US often tries to export its product standards in trade diplomacy, the EU tends to write rules for itself and let the gravity of its huge market pull other economies into its regulatory orbit. Businesses faced with multiple regulatory regimes will tend to work to the highest standard, known widely as the “Brussels effect”.  Companies such as Facebook have promised to follow GDPR throughout their global operations as the price of operating in Europe.

Excerpts from   Data protectionism: the growing menace to global business, Financial Times, May 13, 2018

Behavior Mining: your smartphone knows you better

Currently, understanding and assessing the readiness of the warfighter is complex, intrusive, done relatively infrequently, and relies heavily on self-reporting. Readiness is determined through medical intervention with the help of advanced equipment, such as electrocardiographs (EKGs) and otherspecialized medical devices that are too expensive and cumbersome to employ continuously without supervision in non-controlled environments. On the other hand, currently 92% of adults in the United States own a cell phone, which could be used as the basis for continuous, passive health and readiness assessment.  The WASH program will use data collected from cellphone sensors to enable novel algorithms that conduct passive, continuous, real-time assessment of the warfighter.

DARPA’s WASH [Warfighter Analytics using Smartphones for Health] will extract physiological signals, which may be weak and noisy, that are embedded in the data obtained through existing mobile device sensors (e.g., accelerometer, screen, microphone). Such extraction and analysis, done on a continuous basis, will be used to determine current health status and identify latent or developing health disorders. WASH will develop algorithms and techniques for identifying both known indicators of physiological problems (such as disease, illness, and/or injury) and deviations from the warfighter’s micro-behaviors that could indicate such problems.

Excerpt from Warfighter Analytics using Smartphones for Health (WASH)
Solicitation Number: DARPA-SN-17-4, May, 2, 2018

See also Modeling and discovering human behavior from smartphone sensing life-log data for identification purpose

Supply Chains Live: combating deforestation

366 companies, worth $2.9 trillion, have committed to eliminating deforestation from their supply chains, according to the organization Supply Change. Groups such as the Tropical Forest Alliance 2020, the Consumer Goods Forum and Banking Environment Initiative aim to help them achieve these goals.  Around 70 percent of the world’s deforestation still occurs as a result of production of palm oil, soy, beef, cocoa and other agricultural commodities. These are complex supply chains.  A global company like Cargill, for example, sources tropical palm, soy and cocoa from almost 2,000 mills and silos, relying on hundreds of thousands of farmers. Also, many products are traded on spot markets, so supply chains can change on a daily basis. Such scale and complexity make it difficult for global corporations to trace individual suppliers and root out bad actors from supply chains.

Global Forest Watch (GFW), a WRI-convened partnership that uses satellites and algorithms to track tree cover loss in near-real time, is one example. Any individual with a cell phone and internet connection can now check if an area of forest as small as a soccer penalty box was cleared anywhere in the world since 2001. GFW is already working with companies like Mars, Unilever, Cargill and Mondelēz in order to assess deforestation risks in an area of land the size of Mexico.

Other companies are also employing technological advances to track and reduce deforestation. Walmart, Carrefour and McDonalds have been working together with their main beef suppliers to map forests around farms in the Amazon in order to identify risks and implement and monitor changes. Banco do Brasil and Rabobank are mapping the locations of their clients with a mobile-based application in order to comply with local legal requirements and corporate commitments. And Trase, a web tool, publicizes companies’ soy-sourcing areas by analyzing enormous amounts of available datasets, exposing the deforestation risks in those supply chains…

[C]ompanies need to incorporate the issue into their core business strategies by monitoring deforestation consistently – the same way they would track stock markets.

With those challenges in mind, WRI and a partnership of major traders, retailers, food processors, financial institutions and NGOs are building the go-to global decision-support system for monitoring and managing land-related sustainability performance, with a focus on deforestation commitments. Early partners include Bunge, Cargill, Walmart, Carrefour, Mars, Mondelēz, the Inter-American Investment Corporation, the Nature Conservancy, Rainforest Alliance and more.  Using the platform, a company will be able to plot the location of thousands of mills, farms or municipalities; access alerts and dashboards to track issues such as tree cover loss and fires occurring in those areas; and then take action. Similarly, a bank will be able to map the evolution of deforestation risk across its whole portfolio. This is information that investors are increasingly demanding.

Excerpt from Save the Forests? There’s Now an App for That, World Resources Institute, Jan. 18, 2017

From Subversive to Submissive: the internet

The corridor where WWW was born. CERN, ground floor of building No.1

Free-Speech advocates were aghast—and data-privacy campaigners were delighted—when the European Court of Justice (ECJ) embraced the idea of a digital “right to be forgotten” in May 2014. It ruled that search engines such as Google must not display links to “inadequate, irrelevant or no longer relevant” information about people if they request that they be removed, even if the information is correct and was published legally.

The uproar will be even louder should France’s highest administrative court, the Conseil d’État, soon decide against Google. The firm currently removes search results only for users in the European Union. But France’s data-protection authority, CNIL, says this is not enough: it wants Google to delete search links everywhere. Europe’s much-contested right to be forgotten would thus be given global reach. The court… may hand down a verdict by January.

The spread of the right to be forgotten is part of a wider trend towards the fragmentation of the internet. Courts and governments have embarked on what some call a “legal arms race” to impose a maze of national or regional rules, often conflicting, in the digital realm
The internet has always been something of a subversive undertaking. As a ubiquitous, cross-border commons, it often defies notions of state sovereignty. A country might decide to outlaw a certain kind of service—a porn site or digital currency, say—only to see it continue to operate from other, more tolerant jurisdictions.

As long as cyberspace was a sideshow, governments did not much care. But as it has penetrated every facet of life, they feel compelled to control it. The internet—and even more so cloud computing, ie, the storage of vast amounts of data and the supply of myriad services online—has become the world’s über-infrastructure. It is creating great riches: according to the Boston Consulting Group, the internet economy (e-commerce, online services and data networks, among other things) will make up 5.3% of GDP this year in G20 countries. But it also comes with costs beyond the erosion of sovereignty. These include such evils as copyright infringement, cybercrime, the invasion of privacy, hate speech, espionage—and perhaps cyberwar.

IIn response, governments are trying to impose their laws across the whole of cyberspace. The virtual and real worlds are not entirely separate. The term “cloud computing” is misleading: at its core are data centres the size of football fields which have to be based somewhere….

New laws often include clauses with extraterritorial reach. The EU’s General Data Protection Regulation will apply from 2018 to all personal information on European citizens, even if the company holding it is based abroad.

In many cases, laws seek to keep data within, or without, national borders. China has pioneered the blocking of internet addresses with its Great Firewall, but the practice has spread to the likes of Iran and Russia. Another approach is “data localisation” requirements, which mandate that certain types of digital information must be stored locally or remain in the country. A new law in Russia, for instance, requires that the personal information of Russian citizens is kept in national databases…Elsewhere, though, data-localisation polices are meant to protect citizens from snooping by foreign powers. Germany has particularly stringent data-protection laws which hamper attempts by the European Commission, the EU’s civil service, to reduce regulatory barriers to the free flow of data between member-states.

Fragmentation caused by government action would be less of a concern if other factors were not also pushing in the same direction–new technologies, such as firewalls and a separate “dark web”, which is only accessible using a special browser. Commercial interests, too, are a dividing force. Apple, Facebook, Google and other tech giants try to keep users in their own “walled gardens”. Many online firms “geo-block” their services, so that they cannot be used abroad….

Internet experts distinguish between governance “of” the internet (all of the underlying technical rules that make it tick) and regulation “on” the internet (how it is used and by whom). The former has produced a collection of “multi-stakeholder” organisations, the best-known of which are ICANN, which oversees the internet’s address system, and the Internet Engineering Task Force, which comes up with technical standards…..

Finding consensus on technical problems, where one solution often is clearly better than another, is easier than on legal and political matters. One useful concept might be “interoperability”: the internet is a network of networks that follow the same communication protocols, even if the structure of each may differ markedly.

Excerpts from Online governance: Lost in the splinternet, Economist, Nov. 5, 2016

America Inc. and its Moat

moat

Warren Buffett, the 21st century’s best-known investor, extols firms that have a “moat” around them—a barrier that offers stability and pricing power.One way American firms have improved their moats in recent times is through creeping consolidation. The Economist has divided the economy into 900-odd sectors covered by America’s five-yearly economic census. Two-thirds of them became more concentrated between 1997 and 2012 (see charts 2 and 3). The weighted average share of the top four firms in each sector has risen from 26% to 32%…

These data make it possible to distinguish between sectors of the economy that are fragmented, concentrated or oligopolistic, and to look at how revenues have fared in each case. Revenues in fragmented industries—those in which the biggest four firms together control less than a third of the market—dropped from 72% of the total in 1997 to 58% in 2012. Concentrated industries, in which the top four firms control between a third and two-thirds of the market, have seen their share of revenues rise from 24% to 33%. And just under a tenth of the activity takes place in industries in which the top four firms control two-thirds or more of sales. This oligopolistic corner of the economy includes niche concerns—dog food, batteries and coffins—but also telecoms, pharmacies and credit cards.

The ability of big firms to influence and navigate an ever-expanding rule book may explain why the rate of small-company creation in America is close to its lowest mark since the 1970s … Small firms normally lack both the working capital needed to deal with red tape and long court cases, and the lobbying power that would bend rules to their purposes….

Another factor that may have made profits stickier is the growing clout of giant institutional shareholders such as BlackRock, State Street and Capital Group. Together they own 10-20% of most American companies, including ones that compete with each other. Claims that they rig things seem far-fetched, particularly since many of these funds are index trackers; their decisions as to what to buy and sell are made for them. But they may well set the tone, for example by demanding that chief executives remain disciplined about pricing and restraining investment in new capacity. The overall effect could mute competition.

The cable television industry has become more tightly controlled, and many Americans rely on a monopoly provider; prices have risen at twice the rate of inflation over the past five years. Consolidation in one of Mr Buffett’s favourite industries, railroads, has seen freight prices rise by 40% in real terms and returns on capital almost double since 2004. The proposed merger of Dow Chemical and DuPont, announced last December, illustrates the trend to concentration. //

Roughly another quarter of abnormal profits comes from the health-care industry, where a cohort of pharmaceutical and medical-equipment firms make aggregate returns on capital of 20-50%. The industry is riddled with special interests and is governed by patent rules that allow firms temporary monopolies on innovative new drugs and inventions. Much of health-care purchasing in America is ultimately controlled by insurance firms. Four of the largest, Anthem, Cigna, Aetna and Humana, are planning to merge into two larger firms.

The rest of the abnormal profits are to be found in the technology sector, where firms such as Google and Facebook enjoy market shares of 40% or more

But many of these arguments can be spun the other way. Alphabet, Facebook and Amazon are not being valued by investors as if they are high risk, but as if their market shares are sustainable and their network effects and accumulation of data will eventually allow them to reap monopoly-style profits. (Alphabet is now among the biggest lobbyists of any firm, spending $17m last year.)…

Perhaps antitrust regulators will act, forcing profits down. The relevant responsibilities are mostly divided between the Department of Justice (DoJ) and the Federal Trade Commission (FTC), although some …[But]Lots of important subjects are beyond their purview. They cannot consider whether the length and security of patents is excessive in an age when intellectual property is so important. They may not dwell deeply on whether the business model of large technology platforms such as Google has a long-term dependence on the monopoly rents that could come from its vast and irreproducible stash of data. They can only touch upon whether outlandishly large institutional shareholders with positions in almost all firms can implicitly guide them not to compete head on; or on why small firms seem to be struggling. Their purpose is to police illegal conduct, not reimagine the world. They lack scope.

Nowhere has the alternative approach been articulated. It would aim to unleash a burst of competition to shake up the comfortable incumbents of America Inc. It would involve a serious effort to remove the red tape and occupational-licensing schemes that strangle small businesses and deter new entrants. It would examine a loosening of the rules that give too much protection to some intellectual-property rights. It would involve more active, albeit cruder, antitrust actions. It would start a more serious conversation about whether it makes sense to have most of the country’s data in the hands of a few very large firms. It would revisit the entire issue of corporate lobbying, which has become a key mechanism by which incumbent firms protect themselves.

Excerpts from Too Much of a Good Thing, Economist, Mar. 26, 2016, at 23

Over-eating…Data

U.S. Marine Corps Sgt. A.C. Wilson uses a retina scanner to positively identify a member of the Baghdaddi city council prior to a meeting with local tribal figureheads, sheiks, community leaders and U.S. service members deployed with Regimental Combat Team-7 in Baghdaddi, Iraq, on Jan. 10, 2007. Photo released by DOD

Despite their huge potential, artificial intelligence and biometrics still very much need human input for accurate identification, according to the director of the Defense Advanced Research Projects Agency.  Speaking at  an Atlantic Council event, Arati Prabhakar said that while the best facial recognition systems out there are statistically better than most humans at image identification, that when they’re wrong, “they are wrong in ways that no human would ever be wrong”….

“You want to embrace the power of these new technologies but be completely clear-eyed about what their limitations are so that they don’t mislead us,” Prabhakar said. That’s a stance humans must take with technology writ large, she said, explaining her hesitance to take for granted what many of her friends in Silicon Valley often assume  — that more data is always a good thing.  More data could just mean that you have so much data that whatever hypothesis you have you can find something that supports it,” Prabhakar said

See also DARPA Brandeis Project; Facebook’s collection of biometric information

DARPA director cautious over AI, biometrics, Planet Biometrics, May 4, 2016

What’s Your Threat Score?

instagram

Among the 38 previously undisclosed companies receiving In-Q-Tel funding, the research focus that stands out is social media mining and surveillance; the portfolio document lists several tech companies pursuing work in this area, including Dataminr, Geofeedia, PATHAR, and TransVoyant….The investments appear to reflect the CIA’s increasing focus on monitoring social media. In September 2015, David Cohen, the CIA’s second-highest ranking official, spoke at length at Cornell University about a litany of challenges stemming from the new media landscape. The Islamic State’s “sophisticated use of Twitter and other social media platforms is a perfect example of the malign use of these technologies,” he said…

The latest round of In-Q-Tel investments comes as the CIA has revamped its outreach to Silicon Valley, establishing a new wing, the Directorate of Digital Innovation…

Dataminr directly licenses a stream of data from Twitter to visualize and quickly spot trends on behalf of law enforcement agencies and hedge funds, among other clients.  Geofeedia collects geotagged social media messages to monitor breaking news events in real time.Geofeedia specializes in collecting geotagged social media messages, from platforms such as Twitter and Instagram, to monitor breaking news events in real time. The company, which counts dozens of local law enforcement agencies as clients, markets its ability to track activist protests on behalf of both corporate interests and police departments.PATHAR mines social media to determine networks of association…

PATHAR’s product, Dunami, is used by the Federal Bureau of Investigation to “mine Twitter, Facebook, Instagram and other social media to determine networks of association, centers of influence and potential signs of radicalization,” according to an investigation by Reveal.

TransVoyant analyzes data points to deliver insights and predictions about global events.  TransVoyant, founded by former Lockheed Martin Vice President Dennis Groseclose, provides a similar service by analyzing multiple data points for so-called decision-makers. The firm touts its ability to monitor Twitter to spot “gang incidents” and threats to journalists. A team from TransVoyant has worked with the U.S. military in Afghanistan to integrate data from satellites, radar, reconnaissance aircraft, and drones….

The recent wave of investments in social media-related companies suggests the CIA has accelerated the drive to make collection of user-generated online data a priority. Alongside its investments in start-ups, In-Q-Tel has also developed a special technology laboratory in Silicon Valley, called Lab41, to provide tools for the intelligence community to connect the dots in large sets of data.  In February, Lab41 published an article exploring the ways in which a Twitter user’s location could be predicted with a degree of certainty through the location of the user’s friends. On Github, an open source website for developers, Lab41 currently has a project to ascertain the “feasibility of using architectures such as Convolutional and Recurrent Neural Networks to classify the positive, negative, or neutral sentiment of Twitter messages towards a specific topic.”

Collecting intelligence on foreign adversaries has potential benefits for counterterrorism, but such CIA-supported surveillance technology is also used for domestic law enforcement and by the private sector to spy on activist groups.

Palantir, one of In-Q-Tel’s earliest investments in the social media analytics realm, was exposed in 2011 by the hacker group LulzSec to be innegotiation for a proposal to track labor union activists and other critics of the U.S. Chamber of Commerce, the largest business lobbying group in Washington. The company, now celebrated as a “tech unicorn” …

Geofeedia, for instance, promotes its research into Greenpeace activists, student demonstrations, minimum wage advocates, and other political movements. Police departments in Oakland, Chicago, Detroit, and other major municipalities havecontracted with Geofeedia, as well as private firms such as the Mall of America and McDonald’s.

Lee Guthman, an executive at Geofeedia, told reporter John Knefel that his company could predict the potential for violence at Black Lives Matter protests just by using the location and sentiment of tweets. Guthman said the technology could gauge sentiment by attaching “positive and negative points” to certain phrases, while measuring “proximity of words to certain words.”

Privacy advocates, however, have expressed concern about these sorts of automated judgments.“When you have private companies deciding which algorithms get you a so-called threat score, or make you a person of interest, there’s obviously room for targeting people based on viewpoints or even unlawfully targeting people based on race or religion,” said Lee Rowland, a senior staff attorney with the American Civil Liberties Union.”

Excerpt from Lee Fang, THE CIA IS INVESTING IN FIRMS THAT MINE YOUR TWEETS AND INSTAGRAM PHOTOS, Intercept, Apr. 14, 2016

The Beauty of Platform Capitalism

keep calm and love data

Hardly a day goes by without some tech company proclaiming that it wants to reinvent itself as a platform. …Some prominent critics even speak of “platform capitalism” – a broader transformation of how goods and services are produced, shared and delivered.   Such is the transformation we are witnessing across many sectors of the economy: taxi companies used to transport passengers, but Uber just connects drivers with passengers. Hotels used to offer hospitality services; Airbnb just connects hosts with guests. And this list goes on: even Amazon connects booksellers with buyers of used books.d innovation, the latter invariably wins….

But Uber’s offer to drivers in Seoul does raise some genuinely interesting questions. What is it that Uber’s platform offers that traditional cabs can’t get elsewhere? It’s mostly three things: payment infrastructure to make transactions smoother; identity infrastructure to screen out any unwanted passengers; and sensor infrastructure, present on our smartphones, which traces the location of the car and the customer in real time. This list has hardly anything to do with transport; they are the kind of peripheral activity that traditional taxi companies have always ignored.

However, with the transition to knowledge-based economy, these peripherals are no longer really peripherals – they are at the very centre of service provision.There’s a good reason why so many platforms are based in Silicon Valley: the main peripherals today are data, algorithms and server power. And this explains why so many renowned publishers would team up with Facebook to have their stories published there in a new feature called Instant Articles. Most of them simply do not have the know-how and the infrastructure to be as nimble, resourceful and impressive as Facebook when it comes to presenting the right articles to the right people at the right time – and doing it faster than any other platform.

Few industries could remain unaffected by the platform fever. The unspoken truth, though, is that most of the current big-name platforms are monopolies, riding on the network effects of operating a service that becomes more valuable as more people join it. This is why they can muster so much power; Amazon is in constant power struggles with publishers – but there is no second Amazon they can turn to.

Venture capitalists such as Peter Thiel want us to believe that this monopoly status is a feature, not a bug: if these companies weren’t monopolies, they would never have so much cash to spend on innovation.  This, however, still doesn’t address the question of just how much power we should surrender to these companies.

Making sure that we can move our reputation – as well as our browsing history and a map of our social connections – between platforms would be a good start. It’s also important to treat other, more technical parts of the emerging platform landscape – from services that can verify our identity to new payment systems to geolocational sensors – as actual infrastructure (and thus ensuring that everybody can access it on the same, nondiscriminatory terms) is also badly needed.

Most platforms are parasitic: feeding off existing social and economic relations. They don’t produce anything on their own – they only rearrange bits and pieces developed by someone else. Given the enormous – and mostly untaxed – profits made by such corporations, the world of “platform capitalism”, for all its heady rhetoric, is not so different from its predecessor. The only thing that’s changed is who pockets the money.

Excerpt from Evgeny Morozov, Where Uber and Amazon rule: welcome to the world of the platform, Guardian, Nov. 15, 2015

MEMEX in Action: Searching the Deep Web

mechanical spider. image from wikipedia

DARPA’s Memex search technologies have garnered much interest due to their initial mainstream application: to uncover human trafficking operations taking place on the “dark web”, the catch-all term for the various internet networks the majority of people never use, such as Tor, Freenet and I2P. And a significant number of law enforcement agencies have inquired about using the technology. But Memex promises to be disruptive across both criminal and business worlds.

Christopher White, who leads the team of Memex partners, which includes members of the Tor Project, a handful of prestigious universities, NASA and research-focused private firms, tells FORBES the project is so ambitious in its scope, it wants to shake up a staid search industry controlled by a handful of companies: Google, Microsoft,  and Yahoo.

Putting those grandiose ideas into action, DARPA will today open source various components of Memex, allowing others to take the technologies and adapt them for their own use. As is noticeable from the list of technologies below, there’s great possibility for highly-personalised search, whether for agents trying to bring down pedophiles or the next Silk Road, or anyone who wants a less generic web experience.

Uncharted Software, University of Southern California and Next Century Corporation
These three have produced the front-end interfaces, called TellFinder and DIG, currently being used by Memex’s law enforcement partners. “They’re very good at making things look slick and shiny. Processing and displaying information is really hard and quite subjective,” says White.

The ArrayFire tech is a software library designed to support accelerated computing, turbo-boosting web searches over GPUs. “A few lines of code in ArrayFire can replace dozens of lines of parallel computing code, saving users valuable time and lowering development costs,” the blurb for the technology reads.

Carnegie Mellon University (CMU) is building various pieces of the Memex puzzle, but its TJBatchExtractor is what’s going open source today. It allows a user to extract data, such as a name, organisation or location, from advertisements. It was put to good use in the anti-human trafficking application already in use by law enforcement agencies.

Diffeo’s Dossier Stack learns what a user wants as they search the internet. “Instead of relying on Google’s ranking to tell you what’s important, you can say, “I want the Thomas that’s in the UK not the US, so don’t send me anything that has US-oriented information,” explains White.

Hyperion Gray’s crawlers are designed to replicate human interaction with websites. “Think of what they do as web crawling on steroids,” says White. Its AutoLogin component takes authentication credentials funnelled into the system to crawl into password-protected areas of websites, whilst Formasaurus does the same but for web forms, determining what happens when fields are filled in. The Frontera, SourcePin and Splash tools make it easy for the average user to organise and view the kind of content they want in their results. Its HG Profiler code looks for matches of data across different pages where there’s no hyperlink making it obvious. Hyperion Gray also built Scrapy-­Dockerhub, which allows easy repackaging of crawlers into Docker containers, allowing for “better and easier web crawling”, notes White.

IST Research and Parse.ly: “These tools [Scrapy Cluster, pykafka and steamparse] are major infrastructure components so that you can build a very scalable, real-time web crawling architecture.”

Jet Propulsion Laboratory (JPL). This NASA-based organisation has crafted a slew of Memex building blocks, four of which – ImageCat, FacetSpace, LegisGATE and ImageSpace – are applications built on top of Apache Software Foundation projects that allow users to analyse and manipulate vast numbers of images and masses of text. JPL also created a video and image analysis system called SMQTK to rank that kind of visual content based on relevance, making it easy for the user to connect files to the topic they care about. Its Memex Explorer brings all those tools together under a common interface.

MIT Lincoln Laboratory.  Three of MIT’s contributions – Text.jl, MITIE, Topic – are natural language processing tools. They allow the user, for example, to search for where two organisations are mentioned in different documents, or to ask for terse descriptions of what a document or a webpage is about.

New York University.  NYU, in collaboration with JPL and Continuum Analytics, has created an interface called Topic, which lets the user interact with “focused crawlers”, which consistently update indexes to produce what’s relevant to the user, always “narrowing the thing they’re crawling”, notes White. “We have a few of these different kinds of crawlers as it’s not clear for every domain what the right crawling strategy is.

Qadium.  This San Francisco firm has submitted a handful of utilities that allow for “data marshalling”, a way to organise data so it can be inspected in different ways.

Sotera Defense Solutions. This government contractor has created the aptly-named DataWake. It collects all links that the user didn’t click on but could, and maybe should, have. This “wake” includes the data behind those links.

SRI International.  SRI is working alongside the Tor Project, the US Navy and some of the original creators of Tor, the anonymising browser that encrypts traffic and loops users through a number of servers to protect their identities. SRI has developed a “dark crawler” called the Hidden Service Forum Spider, that grabs content from Hidden Services – those sites hosted on Tor nodes and are used for especially private services, be they drug markets or human rights forums for those living under repressive regimes. The HSProbe, meanwhile, looks for Hidden Service domains. The Memex team is keen to learn more about the darker corners of the web, partly to help law enforcement clean it of illegal content, but also to get a better understanding of how big the unmapped portions of the internet are.

DARPA is funding the Tor Project, which is one of the most active supporters of privacy in the technological world, and the US Naval Research Laboratory to test the Memex tools. DARPA said Memex wasn’t about destroying the privacy protections offered by Tor, even though it wanted to help uncover criminals’ identities. “None of them [Tor, the Navy, Memex partners] want child exploitation and child pornography to be accessible, especially on Tor. We’re funding those groups for testing,” says White.

DeepDive from Stanford turns text and multimedia into “knowledge bases”, creating connections between relationships of the different people or groups being searched for. “It’s machine learning tech for inferring patterns, working relationships… finding links across a very large amount of documents,” adds White.

Excerpts from Thomas Fox-Brewster, Watch Out Google, DARPA Just Open Sourced All This Swish ‘Dark Web’ Search Tech,Forbes, Apr. 17, 2015

For extensive information see DARPA MEMEX

Online Anonymity Guaranteed by DARPA

International Data Encryption Algorithm. image from wikipedia

From the DARPA website—DARPA “BRANDEIS” PROGRAM AIMS TO ENSURE ONLINE PRIVACY

DARPA announced plans on March 11, 2015 to research and develop tools for online privacy, one of the most vexing problems facing the connected world as devices and data proliferate beyond a capacity to be managed responsibly. Named for former Supreme Court Justice Louis Brandeis, who while a student at Harvard law school co-developed the concept of a “right to privacy”…The goal of DARPA’s newly launched Brandeis program is to enable information systems that would allow individuals, enterprises and U.S. government agencies to keep personal and/or proprietary information private.

Existing methods for protecting private information fall broadly into two categories: filtering the release of data at the source, or trusting the user of the data to provide diligent protection. Filtering data at the source, such as by removing a person’s name or identity from a data set or record, is increasingly inadequate because of improvements in algorithms that can cross-correlate redacted data with public information to re-identify the individual. According to research conducted by Dr. Latanya Sweeney at Carnegie Mellon University, birthdate, zip code and gender are sufficient to identify 87% of Americans by name.

On the other side of the equation, trusting an aggregator and other data recipients to diligently protect their store of data is also difficult. In the past few months alone, as many as 80 million social security numbers were stolen from a health insurer, terabytes of sensitive corporate data (including personnel records) were exfiltrated from a major movie studio and many personal images were illegitimately downloaded from cloud services.

“Currently, most consumers do not have effective mechanisms to protect their own data, and the people with whom we share data are often not effective at providing adequate protection,,”

—-

Currently, we do not have effective mechanisms to protect data ourselves, and the people with whom we share data are often not effective at providing adequate protection.The vision of the Brandeis program is to break the tension between (a) maintaining privacy and (b) being able to tap into the huge value of data. Rather than having to balance between them, Brandeis aims to build a third option, enabling safe and predictable sharing of data in which privacy is preserved. Specifically, Brandeis will develop tools and techniques that enable us to build systems in which private data may be used only for its intended purpose and no other. The potential for impact is dramatic. Assured data privacy can open the doors to personal medicine (leveraging cross-linked genotype/phenotype data), effective smart cities (where buildings, energy use, and traffic controls are all optimized minute by minute), detailed global data (where every car is gathering data on the environment, weather, emergency situations, etc.), and fine grained internet awareness (where every company and device shares network and cyber-attack data). Without strong privacy controls, every one of these possibilities would face systematic opposition [it should].

From the DARPA website,

DARPA Brandies (pdf)